The invention is in the field of finding diagnostic assays for serious illnesses. In particular, it concerns a new marker that can be useful in diagnosing ALS and a method to detect ALS and PSAD.
More than 5 million people in the US are currently living with AD. There is currently no cure or good treatment for AD, but early detection and management of the disease leads to reduced treatment cost and higher quality of life. Treatment of patients who are presymptomatic or have mild cognitive impairment (MCI), a condition that precedes the dementia characteristic of AD, can result in at least measured success. Use of therapeutics with a focus on treating presymptomatic AD (PSAD) is consistent with the fact that irreversible neuronal damage is detectible years to decades before onset of MCI. There is a critical need for reliable, low-cost non-invasive biomarkers of PSAD (for both early detection in the clinic and for drug efficacy testing by pharmaceutical companies); however, existing assays for direct detection of PSAD from serum remain unreliable despite many years of investigation.
Another problematic neurodegenerative disease is amyotrophic lateral sclerosis (ALS). ALS is extremely debilitating and can lead to weakness, paralysis and, ultimately, death. It is also known as Lou Gehrig's disease. The current state of diagnosis is complex and there are no known markers that are reliable for providing a useful diagnosis.
It is known that defects in the gene encoding TDP-43 can lead to ALS, and that misfolded TDP-43 is a major constituent in protein aggregates in many patients with ALS regardless of whether a mutation exists in this gene. (TAR DNA-binding protein 43 (TDP-43) is a transactive response DNA-binding protein with a molecular weight of 43 kD. It is a cellular protein which in humans is encoded by the TARDBP gene.) It is also known that TDP-43 aggregation is at first localized, but then spreads to neighboring unaffected neurons leading to more severe and widespread symptoms. One approach to disease progression is to stop the spread of protein aggregation that is transmitted from one cell to another, but the mechanism of spreading is not understood. One potential adjunct to such spreading is through a signaling molecule called casein kinase 1 gamma 2 (CK1γ2). It is changes in this protein that are the aspect of the present invention.
During ALS progression, there is an ordered spread of weakness and loss of motor control from point of onset to other regions in a spatiotemporal manner, suggesting the existence of soluble factors that can spread disease between cells. Consistent with this, in vitro models of ALS show that serum or cerebral spinal fluid from patients with ALS result in increased neuronal death. In addition, glial cells can spread toxicity to motor neurons in mice and in cell culture. These data demonstrate that ALS pathology can be spread from serum to cells, so that exposing cultured cells to serum is indicated as a method to identify and characterize cellular responses to signals of disease. As noted above, a proposed mechanism for the spread of disease to unaffected cells is the transfer of misfolded proteins from one cell to another, and conversion of normally folded proteins in the new cell into the aberrant conformation by a prion-like mechanism (Polymenidou M., et al., Cell (1997) 147:498-508). Misfolded proteins in ALS patients include SOD1, TDP-43 and FUS, and there is evidence for SOD1 acting as a template in this way, but evidence for the other proteins is lacking. Data showing that motor neuron toxicity in one system was mediated through glial SOD1 synthesis, suggests that ALS can spread from one cell to another in a SOD1 dependent manner and that prion-like spreading is a plausible explanation. However detection of toxicity transferred from human astrocytes to mouse motor neurons suggests the existence of a second mechanism (as human SOD1 is not a substrate that can seed mouse SOD1 aggregation. The present invention, in one aspect, concerns a novel second mechanism of ALS transmission between cells that is distinct from the prion model.
In relation to the foregoing, a hyper-phosphorylated, ubiquitinated and cleaved form of the TDP-43 (known as pathological TDP-43) is a major disease protein in ALS. Hyperphosphorylated TDP-43 is a major component of intranuclear and cytoplasmic inclusions deposited in brains of patients with ALS, which colocalize with stress granules. There are data in the art that suggest that a CK1 isoform may be involved in TDP-43 aggregation (Hasegawa, M., et al., Annals of Neurology (2008) 64:60-70; Inukai, Y., et al., FEBS Lett. (2008) 582:2899-2904; and Kametani, F., et al., Biochem/Biophys. Res. Comm. (2009) 382:405-409). These data include experiments with a truncated version of CK1δ with the C-terminal region deleted. This protein is called CK1 because it is missing the C-terminal region where the six CK1 isoforms (α, δ, ε, γ1, γ2 γ3) are most divergent. CK1 strongly phosphorylates TDP-43 in vitro, whereas phosphorylation by other kinases (CK2 or GSK3) is much weaker or was not detected. In addition, electrophoretic mobility shift of CK1-modified TDP-43 is similar to that of hyperphosphorylated TDP-43 associated with ALS in vitro.
Among 28 ALS-related mutations in TDP-43 (including pathologic mutations in familial cases and variants found in sporadic cases), all but one are in the C-terminal Gly-rich region (273-414) which is in the region hyperphosphorylated by CK1 (containing 18 of 29 mapped phosphorylation sites), and this region is required for TDP-43 aggregation and cellular toxicity in vivo. Together these data suggest a role for CK1 in TDP-43 phosphorylation and possibly aggregation, but they do not link CK1 to ALS. It is not known if CK1 activity on TDP-43 is activated by ALS progression, or which of the six isoforms is involved in TDP-43 phosphorylation. The invention, in one aspect, sheds light on these matters.
In one aspect, the invention is directed to a method to determine the probability that a test subject is afflicted with amyotrophic lateral sclerosis (ALS) which method comprises contacting a biological fluid of said test subject with indicator cells and assessing said indicator cells for the level of expression of an exon of CK1γ2 that encodes the C-terminal palmitoylated region of said CK1γ2 whereby a diminished level of expression of this exon as compared to its expression level in said indicator cells when contacted with biological fluid of normal subjects indicates a high probability that said test subject is afflicted with ALS.
In another aspect, the invention is directed to a method to determine the probability of the presence of ALS in a test subject which method comprises using an indicator cell assay platform (iCAP) by contacting indicator cells that are motor neurons derived from stem cells with a biological fluid of said test subject and comparing the expression pattern in said indicator cells to that obtained when said cells are contacted with a biological fluid from normal subjects.
In another aspect, the invention is directed to a method to determine the probability of the presence of presymptomatic or symptomatic Alzheimer's disease (PSAD) in a test subject which method comprises using an indicator cell assay platform (iCAP) by contacting indicator cells that are pan neuronal populations of glutamatergic (and GABAergic) neurons with biological fluid of said test subject and comparing the expression pattern in said indicator cells to that obtained when said cells are contacted with biological fluid from normal subjects.
The platform iCAP is subject to a number of assay formats, but typically, the assays for expression in indicator cells are conducted by extracting mRNA, optionally obtaining corresponding cDNA, and then assessing the levels of the mRNA and/or cDNA using complementary probes thereto. Expression levels of specific genes are particularly useful in all of these determinations.
U.S. patent publication US2012/0245048, the contents of which are incorporated herein by reference, describes an assay designed to detect the presence of ALS by assessing the biological fluid of a test subject for markers that result from treating said biological fluid with spinal motor neurons derived from HGB3 embryonic stem cells. Using this assay, it is found that, as shown in the examples below, the CK1γ2 transcript showed reduced expression of the exon encoding the small C-terminal regulatory region of CK1γ2 which is both palmitoylated and phosphorylated.
Palmitoylation of CK1γ (a closely related Xenopus isoform of CK1γ2) facilitates targeting and tethering of the kinase to the plasma membrane where it is localized under normal conditions. Failure of the mouse exon to be fully expressed should therefore results in a reduction in the amount of protein that is tethered to the plasma membrane and increases the cytoplasmic pool (as has been observed for CK1γ truncations in Xenopus). These data indicate that in the cytoplasm, the CK1γ2 can propagate ALS pathology by phosphorylation of TDP-43 (as has been shown for CK1 in vitro). As noted above, hyperphosphorylation of TDP-43 is characteristic of ALS. Thus, the underexpression of this exon results in a known factor that propagates ALS. One method for ascertaining the expression of the exon is to assess the localization CK1γ2 in cytoplasm of indicator cells.
While use of motor neurons as indicator (responder) cells is contraindicated in the case of Alzheimer's diagnosis, the general approach for detecting ALS is a good surrogate for AD or PSAD since both are neurodegenerative diseases with common underlying pathologies; both are caused by late onset protein misfolding and toxic aggregation, and involve common cellular processes including the ubiquitin-proteasome, programmed cell death, ROS overproduction, and dysfunctional mitochondria and axonal transport (Jellinger, K. A., J. Cell. Mol. Med (2010) 14:457-487; Jellinger, K. A., J. Neural Transm. (2009) 116:1111-1162); Federico, A. et al., J Neurol. Sci (2012) 322:254-262).
A common emphasis on exons results in a determination of splicing as a differential in disease states as compared to normals. Splicing effects about 80% of human genes and aberrant alternative splicing is already linked to neurodegenerative disease and related cellular dysfunctions including proteasome inhibition, and oxidative stress. Splice variants specific for AD and Parkinson's disease have been identified in blood (Potashkin, J. A., et al., PLoS One (2012) 7:e43595 and Fehlbaum-Beurdeley, P., et al., J. Alzheimer's Assoc. (2010) 6:25-38). Splicing can be identified by within-sample comparisons thus diminishing technical error due to between-sample comparisons.
An emphasis on pathways (gene sets) results in determination of gene set enrichment as a differential in disease state as compared to normals. This approach measures expression of gene sets (genes involved in a common cellular pathway or sharing another annotation) instead of individual genes, effectively reducing the number of features considered and identifying statistically significant differential expression of some genes that would otherwise go unnoticed due to noise in the measurement (Subramanian, A., et al., PNAS (2005) 102:15545-15550).
Using pan neuronal glutamatergic (mixed with GABAergic) cells as responders to compare early stage AD plasma samples (post-MCI) to those from cognitively normal subjects (4 replicates of each) (for exon level analysis without disease classification), a t-test was performed (without multiple testing correction) and 2,537 exons were significantly differentially spliced (p-value <0.05). A power calculation was performed suggesting that a significant differential response signature of ˜1000 exons can be generated using data from 20 paired disease/normal experiments.
The assays of the invention can use blood, including serum, and cerebrospinal fluid (CSF) samples which could be run concomitantly. In some assays, the responder cells are grown for 5 days to a steady level of responsiveness and exposed to CSF or serum or other bodily fluid for 24 hours. Transcriptome profiles can be analyzed using Affymetrix® human exon assays.
For using an iCAP to classify the disease state of new subjects, differential gene expression profiles can be used to train a disease classifier to classify new subjects based on their expression profile in the same cell based assay. This can involve first selecting a subset of features (genes, gene sets or exons) that are differentially expressed in the iCAP signatures of disease versus normal subjects using a machine-learning feature selection tool like mProbes (Huynh-Thu, V. A. et al., Bioinformatics (2012) 28:1766-1774), and next training and testing a disease classifier using machine-learning approaches like support vector machines (SVM; Furey, T. S. et al., Bioinformatics (2000) 16:906-914; Brown, M. P. et al., PNAS (2000) 97:262-267).
While a wide variety of assay formats for expression is available, in the examples below, expression levels are determined by obtaining mRNA from the indicator cells, optionally preparing complementary DNA corresponding to the mRNA extracted and assessing the mRNA and/or cDNA for binding to complementary probes. It is possible to assess multiple mRNA and/or cDNA levels at once using arrays of probes, many of which are commercially available.
Further, in the examples below, in addition to the specific detection of expression of the C-terminal palmitoylated region of CK1γ2 for ALS, an overall expression pattern can be obtained for diagnosis both of ALS and symptomatic and presymptomatic AD. In the examples below, specific genes that are over- or under-expressed in the presence of these abnormal conditions when biological fluid from a test subject is contacted with the indicator cells are disclosed. In the case of ALS, murine subjects and indicator cells were used and the genes represented in the array represent murine genes. The method is equally applicable to the ortholog genes in humans and other species. Thus, the methods of the claims are applicable to test samples from any subject susceptible to ALS including mammals in general and especially humans. The illustrative work with regard to AD in Example 2, however, specifies human genes.
The number of genes whose expression levels are to be tested is subject to the judgment of the practitioner. As few as two or as many as 50 or more may be determined simultaneously to obtain a pattern. Thus, one could choose to detect expression levels of, for example, 5, 10, 20, 30, 40, 50 or 100 genes. In the case of ALS, all of the more than 400 specified genes may be assessed. These ranges are intended to include all intervening integers rather than taking up space to articulate each integer specifically, the inclusion of intermediate values is simply referred to herein.
The following examples are intended to illustrate but not limit the invention.
The ALS signature in serum of mice developing ALS was determined using motor neurons as detector cells as described in US2012/0245048. Motor neurons have been shown to be targeted by the disease in a non-small cell autonomous manner (Nagai, M, et al., Nature Neuroscience (2007) 10:615-622), and therefore are responsive to disease-specific signatures in serum.
In one experiment, as set forth in the above-mentioned publication, disease serum was taken from 5 transgenic ALS susceptible mice (SOD1; G93A) at 9 weeks of age and control serum was taken from 5 non-carrier mice of the same age from the same colony.
Spinal motor neurons (MNs) were derived from HGB3 embryonic stem cells expressing a fluorescently labeled motor neuron marker (HB9-eGFP) by a method previously described (Wichterle, H., et al., Cell (2002) 110:385-397) as described below. Unless otherwise specified, growth of ES cells was in differentiation medium (consisting of equal parts Advanced™ DMEM/F12 (Invitrogen) and Neurobasal™ medium (Invitrogen) supplemented with penicillin/streptomycin, 2 mM L-Glutamine, 0.1 mM 2-mercaptoethanol, and 10% KnockOut™ serum replacement (Invitrogen)). ES cells were plated at ˜105 cells per mL and grown in aggregate culture for 2 days to form embryoid bodies (EBs) in a 10 cm2 dish. EBs were split 1:4 into four 10 cm2 dishes and exposed to 1 μM each retinoic acid and sonic hedgehog agonist (Hh-Ag1.3, Curis, Inc.) for two days, to caudalize spinal character and ventralize into MN progenitors, respectively. Medium was changed and EBs were grown for an additional 3 days in differentiation medium to generate MNs. Two dishes of EBs were pooled, washed with PBS and resuspended in 1 mL of differentiation medium. 100 μL of these EBs were inoculated in each of 10 wells of a 3.8 cm2 12-well dish. EBs were incubated for 24 h in 2 mL differentiation medium containing either 5% serum from 9 week-old ALS susceptible mice or 5% serum from normal mice. Each experiment (disease or control) was done five times with serum from five different mice.
RNA was isolated using TRIzol® reagent, and cDNA was synthesized from polyA RNA, labeled and hybridized to Affymetrix® GeneChip® mouse exon arrays according to manufacturer's recommendations.
Probe intensities for ten experiments (five replicates each of control and disease serum) were normalized together and data from probes representing a continuous stretch of putatively transcribed genomic sequence were merged into probe sets (using RMA algorithm of the Affymetrix® Expression Console software). Two filters were applied to exclude probe sets that did not meet the criteria below: 1. Probe sets map to the genome and thus levels are annotated as “core”, “full”, “free” or “extended” by Affymetrix®. 2. Probe sets have high confidence of detection over background in at least 5 of the 10 experiments (P<0.001 determined using the DABG algorithm of the software). After application of these two filters, the data set consisted of 135,181 probe sets.
Probe-level expression values were analyzed for significant differential expression between cells exposed to control serum and those exposed to disease serum using Significance Analysis for Microarrays (SAM) of MeV component of TM4 microarray software (by running a two-class paired analysis using default parameters and the 32 possible unique permutations of the data to calculate the statistic). This analysis generated an ALS disease signature consisting of 441 probe sets that significantly increased in expression in response to disease serum compared to normal serum with q-values and false discovery rates <15%.
The high level of resolution of the above exon arrays was accessed in analysis of differential splicing of mRNA in response to pre-symptomatic ALS mouse serum (versus normal mouse serum) using FIRMA software (Purdom, E., et al., Bioinformatics (2008) 24:1707-1714. The comparison of genes together within the same sample makes the tests invariant to all forms of data normalization that do not affect within-sample quantification. For this analysis, additional data were generated resulting in a total of 41 datasets (including responses to serum from presymptomatic ALS mice (N=20) and age-matched normal mice (N=21)). Next, splice variants were identified and used to find disease-specific differentially expressed exons. Next, exons were ranked by magnitude of differential splicing and disease classification was performed in two steps: 1) Ranked exons were used to build and train an ensemble of classifiers using only half of the samples (11 ALS and 12 normal). The ensemble predicted the remaining 18 independent samples, revealing the classifier accuracy as 82% (p-value <0.001). 2) The top 100 ranked exons from 1) were used to train and test a new classifier using all of the samples. Leave-one-out cross validation predicts classifier accuracy of 78% (p-value <0.0001).
CK1γ2, the top ranked significantly differentially spliced genes in the disease signature, was further characterized to predict its involvement in a cellular response to presymptomatic ALS serum. Differential splicing was analyzed, whereby average intensities for all probe sets within the putative CK1γ2 transcript (supported by RefSeq and full-length mRNA GenBank records) are shown in
The sequences used in the foregoing assay are as follows:
Next an iCAP-based classifier was developed for ALS detection from serum using the same cell-based assay except with analysis of gene-level and exon-level expression data. For this analysis, additional data were generated resulting in a total of 47 datasets (including data using serum from presymptomatic ALS mice (N=23) and age-matched normal mice (N=24)).
Data were merged and two filters were applied to exclude probe sets that did not map to a gene, and probe sets that did not have high confidence of detection over background in at least one experiment (P<0.01 determined using the DABG algorithm of the software).
All data were co-normalized (Purdom, E. et al., Bioinformatics (2008) 24:1707-1714), and half of the data (12 of control class and 11 of disease class) were used to build a disease classifier. To do this, three feature types were analyzed for significant differential enrichment between the classes including splice variants (Purdom, E., et al., Bioinformatics (2008) 24:1707-1714; Irizarry, R. A., et al., Nucleic Acids Res. (2003) 31:e15; Irizarry, R. A., et al., Biostatistics (2003) 4:249-264), genes and pathways (Efron, B., et al., The Annals of Applied Statistics (2007) 1:107-129). Pathways are sets of genes share a common annotation including those from GO, KEGG and REACTOME, and were used as features in attempt to capture complex interactions between variables.
Next, features were selected by ranking (based on magnitude and significance scores) and using mProbes, a machine-learning feature selection tool that uses artificially generated random features to generate a noise model (Huynh-Thu, V. A. et al., Bioinformatics (2012) 28:1766-1774), to select top features that rise above the noise for classification (FDR <100% or other metrics).
Sets of selected features were used to build and train disease classifiers using Support Vector Machines (SVM) with polynomial kernels (an approach that performs well with the large number of features of gene expression datasets) (Furey, T. S., et al., Bioinformatics (2000) 16:906-914; Brown, M. P., et al., PNAS (2000) 97:262-267), or an ensemble of this SVM with random forest (Breiman, L., Machine Learning (2001) 45:5-32), evolutionary tree and naïve Bayes classifiers. All classifiers were tested by predicting the remaining 24 independent blind samples (12 of each class).
Top classifier performance was observed for iterations using pathway features (absolute GSA scores ≧1) and SVM classification (accuracies of 83-96%). Iterations using pathway features with other classifiers were not as accurate, but performed significantly better than random. To evaluate classifier robustness, one method was selected (SVM classification using mProbes-selected pathway features (absolute GSA scores ≧1 and FDR<100%)) and the analysis was repeated with 24 subsets of the training data (each with one feature removed). Each classifier was made up ˜60 pathway features (representing ˜430 genes). The classifiers performed well with a top classifier accuracy of 96% and correlation coefficient of 0.92 (
Significantly differentially expressed features of the iCAP reflect known aspects of ALS: 1) Gene pathways include the ER stress response mediated by PERK (and transcription factors (TFs), ATF4 and CHOP) (Han, J., et al., Nature Cell Biology (2013) 15:481-490), an early pathological event in ALS (Saxena, S. and Caroni, P., Neuron (2011) 71:35-48) and 2) Gene list includes ATF4 and CHOP (Ddit3) and is enriched for their known targets (Han, J., et al., Nature Cell Biology (2013) 15:481-490). Genes are also significantly enriched for those specifically expressed in microdissected neurons from presymptomatic SOD1 ALS mice (Lobsiger, et al., PNAS (2007) 104:7319-7326; Ferraiuolo, L., et al., J. Neuroscience (2007) 27:9201-9219; Perrin, F. E., et al., Human molecular genetics (2005) 14:3309-3320).
These data establish feasibility of developing a robust iCAP-based classifier for detection of presymptomatic ALS using human serum. In addition to disease classification, the assay may have other utility; significantly differentially expressed features of the iCAP are enriched for genes and processes that have been implicated in ALS, suggesting that the assay may also have utility for understanding disease mechanism and identifying candidate therapeutic targets.
The genes in the pathways used to train the classifier with the top performance (SVM classification of mProbes-selected pathway features (absolute GSA≧1 and FDR<100%) are listed below:
A mix of iPSC-derived glutamatergic and GABAergic neurons (from Cellular Dynamics International) were plated in a 12-well dish (at 600,000 cells/well) and cultured for 5 days. Cells were then exposed to 5% plasma from 4 cognitively normal controls, and 4 patients with confirmed mild cognitive impairment (MCI) for 24 h and RNA was isolated and used for gene expression analysis using Affymetrix® human exon arrays (ST 1.0). The data were merged, normalized, and filtered to include only ˜207,000 of the ˜1.4 M exons on the array that were significantly detected above background (DABG <0.01) for either all of the normal or all of the early symptomatic AD (PSAD) experiments. A t-test was performed on individual exons (i.e., without multiple test correction) and revealed significant differential splicing of 2,537 exons (p-value <0.05) in response to early symptomatic AD versus normal plasma.
The exons in the disease signature correspond to 2,234 genes. Because AD pathogenesis is strongly linked to production and deposition of the beta amyloid peptide, these genes were analyzed for enrichment of the NCBI gene description term “amyloid beta” as a preliminary analysis of AD relatedness. The genes in the preliminary disease signature were significantly enriched for the term “amyloid beta” when compared to all expressed genes on the array (HGD p-value <0.05).
These data formed the basis of a power analysis to estimate the number of experiments needed to obtain significant differential gene splicing between normal and PSAD serum samples in the iCAP (using a t-test with an FDR threshold of 0.05 and a Beta of 0.05). The analysis estimated that performing 20 paired disease/normal experiments would yield a signature made up ˜1000 significantly differentially spliced exons (see
To perform this analysis, the fraction of all transcripts that are expected to be significant from the preliminary AD analysis was calculated. The power.t.test.FDR function in the [R] ‘ssize’ (Warnes, G. R., et al., (2012) “ssize: Estimate Microarray Sample Size”. R package version 1.32.0) toolbox was used to get a false discovery rate (FDR) power analysis estimate for these 2,537 exons. The FDR threshold was set to 0.05, the power to 0.95, and the expected fraction of significant exons to ˜0.002 (i.e., 2,537/1,432,336) to calculate the total number of paired AD/normal experiments needed to reach statistical significance after FDR correction (Note: larger fractions, such as those that use 207,789 instead of 1,432,336 would result in smaller numbers of experiments). As shown the results range from 5 experiments (i.e., one additional AD and one additional normal experiment) for one exon to 32 experiments (i.e., 28 additional AD and 28 additional normal experiments) for all 2,537 exons.
Next, the iCAP was used to train and test a disease classifier for presymptomatic AD. To do this, the assay was repeated with plasma samples from three classes of patients: 1) pre-MCI (cognitively normal patients with AD biomarkers present in CSF), 2) MCI/early AD (patients with mild cognitive impairment (MCI) (Rosen, C., et al., Mol. Neurodegener (2013) 8:20) or early AD), and 3) healthy controls (cognitively normal patients with AD biomarkers not present in CSF).
The data for 15 samples of each class were merged and normalized (Purdom, E., et al., Bioinformatics (2008) 24:1707-1714). Three feature types were analyzed for significant differential enrichment between the classes including genes, splice variants, and pathways (as was done for the ALS iCAP described in Example 1).
Significant differential expression of pathways is reflected by gene set enrichment (GSE) scores calculated using GSEA algorithm (Efron, B. and Tibshirani, R., The Annals of Applied Statistics (2007) 1:107-129). GSE scores with absolute values greater than 1 were considered significantly differentially expressed. Of the total 9633 pathways, 368 were significantly differentially expressed for Pre-MCI versus normal samples and 526 were significantly differentially expressed for MCI/early AD versus normal samples. Comparison of these two pathway sets showed a statistically significant overlap of 205 pathways (hypergeometric distribution probability of 1×10E-177) and these pathways showed either increased or decreased expression in response to disease in both datasets. These data suggest that human blood will be a viable source of AD-specific factors that are detectable using the iCAP, and that data from later-stage patients can be used to build classifiers for early-stage AD.
The gene expression data were used to generate a preliminary disease classifier for AD. To do this, first pre-MCI and MCI/early AD disease samples (30 total) were grouped for comparison against normal samples (15 samples up-sampled to 30).
Next, the top differentially expressed genes between disease and normal samples were selected (from ˜20,000 genes) using three criteria: significance of differential gene expression (t-test p-value), magnitude of differential gene expression (fold change ratio), and significance of differential expression of pathways associated with each gene (pathways were genes sets selected using GSEA algorithm; Efron, B. and Tibshirani, R., The Annals of Applied Statistics (2007) 1:107-129).
Next, an approach was used to find the optimal number of features to build the classifier. This was done by generating various subsets of the top-ranked features, and selecting the smallest subset that maximized the number of informative features for classification (evaluated using a random forest feature selection tool of mProbes; Huynh-Thu, V. A. et al., Bioinformatics (2012) 28:1766-1774). Using this approach, a random forest classifier was trained using the top 500 features.
The classifier was validation against 20 new blind samples that were independent from the samples used to train the classifier. The blind predictive accuracy of the classifier was tested on various subsets of the top ranked genes. Including between 50 and 500 genes results in a classifier accuracy between 75-80%.
Top ranked 50 features used to build the AD iCAP classifier are listed below. APOE, a gene with variant that is the largest known genetic risk factor for late-onset sporadic Alzheimer's disease in several ethnic groups (Sadigh-Eteghad, S. et al., Neurosciences (Riyadh) (2012) 17:321-326), is ranked third.
A test was done on the 500 genes used to build the classifier to predict which genes are most informative to the classifier. This was done by measuring decrease in random forest classifier accuracy when the labels for that feature are shuffled. The top-ranked 50 most informative genes that were not already listed above are shown below:
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US14/57530 | 9/25/2014 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61882547 | Sep 2013 | US |