Autism Spectrum Disorders (ASD) cover a broad spectrum of neurocognitive and social developmental delays with typical onset before 3 years of age including Autistic Disorder, Pervasive Developmental Disorder—Not Otherwise Specified and Asperger's Disorder as subclassified in the Diagnostic and Statistical Manual of Psychiatric Disorders, 4th edition, Text Revision (DSM-IV-TR). Prevalence of ASD has been increasing during last decades, and current estimation is 1 in 91 to 3.7 in 1000. There are waiting lists for evaluation by most centers with expertise, and despite the progress made in adopting instruments such as the Autism Diagnostic Interview-Revised (ADI-R) and the Autism Diagnostic Observation Schedule (ADOS) there remains significant debate regarding the prognostic value and accuracy of existing instruments.
It has been discovered that a variety of genes are differentially expressed in individuals having autism spectrum disorder compared with individuals free of autism spectrum disorder. Such genes are identified herein as “autism spectrum disorder-associated genes”. It has also been discovered that the autism spectrum disorder status of an individual can be classified with a high degree of accuracy, sensitivity, and/or specificity based on expression levels of these autism spectrum disorder-associated genes. Accordingly, methods and related kits are provided herein for characterizing and/or diagnosing autism spectrum disorder in an individual. In some embodiments, methods are provided for subclassifying individuals by molecular endophenotypes (e.g., gene expression profiles).
According to some aspects of the invention, methods are provided for characterizing the autism spectrum disorder status of an individual in need thereof. In some embodiments, the methods involve subjecting a clinical sample obtained from the individual to a gene expression analysis, in which the gene expression analysis comprises determining expression levels of a plurality of autism spectrum disorder-associated genes in the clinical sample using an expression level determining system. In some embodiments, the methods further involve determining the autism spectrum disorder status of the individual based on the expression levels of the plurality of autism spectrum disorder-associated genes. In some embodiments, the methods further involve a step of obtaining the clinical sample from the individual. In some embodiments, the methods further involve a step of diagnosing autism spectrum disorder in the individual based on the autism spectrum disorder status. In some embodiments, the clinical sample is a sample of peripheral blood, brain tissue, or spinal fluid.
In some embodiments, methods are provided that involve applying an autism spectrum disorder-classifier to autism spectrum disorder gene expression levels to determine the autism spectrum disorder status of the individual. For example, according to some aspects of the invention, methods of characterizing the autism spectrum disorder status in an individual in need thereof are provided that involve (a) subjecting a clinical sample obtained from the individual to a gene expression analysis, in which the gene expression analysis comprises determining expression levels of a plurality of autism spectrum disorder-associated genes in the clinical sample using an expression level determining system, in which the autism spectrum disorder-associated genes comprise at least ten genes selected from Table 4, 5, 6, 8, 9, 10, or 11; and (b) applying an autism spectrum disorder-classifier to the expression levels, in which the autism spectrum disorder-classifier characterizes the autism spectrum disorder status of the individual based on the expression levels. In some embodiments, the methods comprise diagnosing autism spectrum disorder in the individual based on the autism spectrum disorder status.
In certain embodiments, the autism spectrum disorder-classifier is based on an algorithm selected from logistic regression, partial least squares, linear discriminant analysis, quadratic discriminant analysis, neural network, naïve Bayes, C4.5 decision tree, k-nearest neighbor, random forest, and support vector machine. In certain embodiments, the autism spectrum disorder-classifier has an accuracy of at least 65%. In certain embodiments, the autism spectrum disorder-classifier has an accuracy in a range of about 65% to 90%. In certain embodiments, the autism spectrum disorder-classifier has a sensitivity of at least 65%. In certain embodiments, the autism spectrum disorder-classifier has a sensitivity in a range of about 65% to about 95%. In certain embodiments, the autism spectrum disorder-classifier has a specificity of at least 65%. In certain embodiments, the autism spectrum disorder-classifier has a specificity in range of about 65% to about 85%.
In some embodiments, the autism spectrum disorder-classifier is trained on a data set comprising expression levels of the plurality of autism spectrum disorder-associated genes in clinical samples obtained from a plurality of individuals identified as having autism spectrum disorder. In certain embodiments, the interquartile range of ages of the plurality of individuals identified as having autism spectrum disorder is from about 2 years to about 10 years. In some embodiments, the autism spectrum disorder-classifier is trained on a data set comprising expression levels of the plurality of autism spectrum disorder-associated genes in clinical samples obtained from a plurality of individuals identified as not having autism spectrum disorder. In certain embodiments, the interquartile range of ages of the plurality of individuals identified as not having autism spectrum disorder is from about 2 years to about 10 years. In some embodiments, the autism spectrum disorder-classifier is trained on a data set consisting of expression levels of the plurality of autism spectrum disorder-associated genes in clinical samples obtained from a plurality of male individuals. In some embodiments, the autism spectrum disorder-classifier is trained on a data set comprising expression levels of the plurality of autism spectrum disorder-associated genes in clinical samples obtained from a plurality of individuals identified as having autism spectrum disorder. In certain embodiments, the individuals were identified as having autism spectrum disorder based on DSM-IV-TR criteria.
In some embodiments, the autism spectrum disorder-associated genes comprise at least one, at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, or at least 90 genes selected from Table 4, 5, 6, 8, 9, 10 or 11. In some embodiments, the autism spectrum disorder-associated genes comprise at least one of: LRRC6, SULF2, and YES1.
In some embodiments, the autism spectrum disorder genes comprise at least one, at least two, at least three, at least four, at least five, at least six, at least seven, or at least eight genes selected from Tables 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, or 24.
In some embodiments, the autism spectrum disorder-associated gene is selected from the group consisting of: ADAM10, ARFGEF1, CAB39, COL4A3BP, CREBBP, DDX42, DNAJC3, HNRNPA2B1, IVNS1ABP, KIAA0247, KIDINS220, MGAT4A, MTMR10, MYO5A, NBEAL2, NCOA6, NUP50, PNN, PTPRE, RBL2, RNF145, ROCK1, RPS6KA3, SERINC3, SIRPA, SLA, SNRK, STK38, SULF2, TBC1D14, TMEM2, TRIP12, UTY, ZDHHC17, ZFP36L2, ZMAT1, ZNF12, and ZNF292. In some embodiments, the autism spectrum disorder-associated gene is selected from the group consisting of: AHNAK, BOD1L, CD9, CNTRL, IFNAR2, KBTBD11, KCNE3, KLHL2, MAN2A2, MAPK14, MEGF9, MIR223, PNISR, RMND5A, SSH2, ZNF516, and ZNF548.
In some embodiments, the methods involve comparing each expression level of the plurality of autism spectrum disorder-associated genes with an appropriate reference level, and the autism spectrum disorder status of the individual is determined based on the results of the comparison. In some embodiments, a higher level of at least one autism spectrum disorder-associated gene selected from: ZNF12, RBL2, ZNF292, IVNS1ABP, ZFP36L2, ARFGEF1, UTY, SLA, KIAA0247, HNRNPA2B1, RNF145, PTPRE, SFRS18, ZNF238, TRIP12, PNN, ZDHHC17, MLL3, MTMR10, STK38, SERINC3, NIPBL, TIGD1, DDX42, NUP50, CAB39, ROCK1, SULF2, FABP2, KIDINS220, NCOA6, SIRPA, PCSK5, ADAM10, ZNF33A, ZMAT1, C10orf28, MGAT4A, CEP110, ZZEF1, CREBZF, DOCK11, ATRN, COL4A3BP, FAM133A, TTC14, TMEM30A, MYO5A, KDM2A, ZCCHC14, RNF44, ZBTB44, CLTC, UTRN, ATXN7, PPP1R12A, LBR, TBC1D14, SPATA13, HK2, CREBBP, MED23, ZFYVE16, PAN3, RBBP6, AVL9, ZNF354A, ACTR2, TMBIM1, RPS6KA3, DNMBP, NBEAL2, MYSM1, TMEM2, SNRK, KIAA1109, HECA, DNAJC3, KIF5B, POLR2B, ANTXR2, VPS13C, MANBA, NIN, LRRC6, and YES1 compared with an appropriate reference level indicates that the individual has autism spectrum disorder. In some embodiments, a lower level of STXBP6 compared with an appropriate reference level indicates that the individual has autism spectrum disorder.
In some embodiments, the autism spectrum disorder-associated genes comprise at least one gene selected from each of at least two of the following KEGG pathways: Neurotrophin signaling pathway, Long-term potentiation, mTOR signaling pathway, Progesterone-mediated oocyte maturation, Regulation of actin cytoskeleton, Fc gamma R-mediated phagocytosis, Renal cell carcinoma, Chemokine signaling pathway, Type II diabetes mellitus, Non-small cell lung cancer, Colorectal cancer, ErbB signaling pathway, Prostate cancer, and Glioma. In some embodiments, the autism spectrum disorder-associated genes comprise at least one gene selected from each of the foregoing KEGG pathways.
In some embodiments, the autism spectrum disorder-associated genes comprise at least two different genes selected from at least two of the following sets: (i) MAPK1, RPS6KA3, YWHAG, CRKL, MAP2K1, PIK3CB, PIK3CD, SH2B3, MAPK8, KIDINS220; (ii) MAPK1, RPS6KA3, GNAQ, MAP2K1, CREBBP, PPP3CB, PPP1R12A; (iii) MAPK1, RPS6KA3, PIK3CB, PIK3CD, CAB39, RICTOR; (iv) IGF1R, MAPK1, RPS6KA3, MAP2K1, PIK3CB, PIK3CD, MAPK8; (v) GNA13, MAPK1, CRKL, ROCK1, MAP2K1, PIK3CB, PIK3CD, SSH2, PPP1R12A, IQGAP2, ITGB2; (vi) MAPK1, PTPRC, DOCK2, CRKL, MAP2K1, PIK3CB, PIK3CD; (vii) MAPK1, CRKL, MAP2K1, PIK3CB, PIK3CD, CREBBP; (viii) MAPK1, DOCK2, CRKL, ROCK1, MAP2K1, PIK3CB, PREX1, PIK3CD, CCR2, CCR10; (ix) MAPK1, PIK3CB, PIK3CD, HK2, MAPK8; (x) MAPK1, RASSF5, MAP2K1, PIK3CB, PIK3CD; (xi) IGF1R, MAPK1, MAP2K1, PIK3CB, PIK3CD, MAPK8; (xii) MAPK1, CRKL, MAP2K1, PIK3CB, PIK3CD, MAPK8; (xiii) IGF1R, MAPK1, MAP2K1, PIK3CB, PIK3CD, CREBBP; and (xiv) IGF1R, MAPK1, MAP2K1, PIK3CB, PIK3CD.
In some embodiments, the autism spectrum disorder genes comprise at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least 30, at least 40, at least 50, at least 60, at least 70, or at least 80 genes selected from Table 6. In some embodiments, the autism spectrum disorder genes comprise all of the genes Table 6.
In some embodiments, the autism spectrum disorder genes comprise at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, or at least 90 genes selected from Table 9. In some embodiments, the autism spectrum disorder genes comprise all of the genes Table 9. In certain embodiments, the autism spectrum disorder is autistic disorder (AUT).
In some embodiments, the autism spectrum disorder genes comprise at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least 30, or at least 40 genes selected from Table 10. In certain embodiments, the autism spectrum disorder is pervasive developmental disorder—not otherwise specified (PDDNOS).
In some embodiments, the autism-spectrum disorder-associated gene is not AFF2, CD44, CNTNAP3, CREBBP, DAPK1, JMJD1C, NIPBL, PTPRC, SH3KBP1, STK39, DOCK8, RPS6KA3, or ATRX.
In some embodiments, the autism spectrum disorder genes comprise at least two, at least three, at least four, at least five, at least six, at least seven, or at least eight genes selected from Table 11. In certain embodiments, the autism spectrum disorder is Asperger's disorder (ASP).
In some embodiments, each expression level is a level of an RNA encoded by an autism spectrum disorder-associated gene. In certain embodiments, the expression level determining system comprises a hybridization-based assay for determining the level of the RNA in the clinical sample. In certain embodiments, the hybridization-based assay is an oligonucleotide array assay, an oligonucleotide conjugated bead assay, a molecular inversion probe assay, a serial analysis of gene expression (SAGE) assay, or an RT-PCR assay.
In some embodiments, each expression level is a level of a protein encoded by an autism spectrum disorder-associated gene. In certain embodiments, the expression level determining system comprises an antibody-based assay for determining the level of the protein in the clinical sample. In certain embodiments, the antibody-based assay is an antibody array assay, an antibody conjugated-bead assay, an enzyme-linked immuno-sorbent (ELISA) assay, or an immunoblot assay.
In some embodiments, the expression levels of autism spectrum disorder associated genes used in the methods comprise a combination of proteins levels and RNA levels.
According to some aspects of the invention, arrays are provided that comprise, or consist essentially of, oligonucleotide probes that hybridize to nucleic acids having sequence correspondence to mRNAs of at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, or at least 90 genes selected from autism spectrum disorder-associated genes selected from Table 4, 5, 6, 8, 9, 10, or 11.
According to some aspects of the invention, arrays are provided that comprise, or consist essentially of, antibodies that bind specifically to proteins encoded by at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, or at least 90 genes selected from autism spectrum disorder-associated genes selected from Table 4, 5, 6, 8, 9, 10, or 11.
According to some aspects of the invention, methods are provided for monitoring progression of an autism spectrum disorder in an individual in need thereof. In some embodiments, the methods involve (a) obtaining a clinical sample from the individual; (b) determining expression levels of a plurality of autism spectrum disorder-associated genes in the clinical sample using an expression level determining system, (c) comparing each expression level determined in (b) with an appropriate reference level, in which the results of the comparison are indicative of the extent of progression of the autism spectrum disorder in the individual.
In some embodiments, the monitoring methods involve (a) obtaining a first clinical sample from the individual, (b) determining expression levels of a plurality of autism spectrum disorder-associated genes in the first clinical sample using an expression level determining system, (c) obtaining a second clinical sample from the individual, (d) determining expression levels of the plurality of autism spectrum disorder-associated genes in the second clinical sample using an expression level determining system, (e) comparing the expression level of each autism spectrum disorder-associated gene determined in (b) with the expression level determined in (d) of the same autism spectrum disorder associated-gene, in which the results of comparing in (e) are indicative of the extent of progression of the autism spectrum disorder in the individual.
In some embodiments, the monitoring methods involve (a) obtaining a first clinical sample from the individual, (b) obtaining a second clinical sample from the individual, (c) determining the expression level of an autism spectrum disorder-associated gene in the first clinical sample using an expression level determining system, (d) determining the expression level of the autism spectrum disorder-associated gene in the second clinical sample using an expression level determining system, (e) comparing the expression level determined in (c) with the expression level determined in (d), (f) performing (c)-(e) for at least one other autism spectrum disorder-associated gene, in which the results of comparing in (e) for the at least two autism spectrum-associated genes are indicative of the extent of progression of the autism spectrum disorder in the individual.
In some embodiments, the monitoring methods involve (a) obtaining a first clinical sample from the individual, (b) obtaining a second clinical sample from the individual, (c) determining a first expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in the first clinical sample using an expression level determining system, (d) determining a second expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in the second clinical sample using an expression level determining system, (e) comparing the first expression pattern with the second expression pattern, in which the results of comparing in (e) are indicative of the extent of progression of the autism spectrum disorder in the individual.
In some embodiments of the monitoring methods, the time between obtaining the first clinical sample and obtaining the second clinical sample is a time sufficient for a change in the severity of the autism spectrum disorder to occur in the individual. In some embodiments of the monitoring methods, in the time between obtaining the first clinical sample and obtaining the second clinical sample the individual is treated for the autism spectrum associated disorder. In some embodiments, the time between obtaining the first clinical sample and obtaining the second clinical sample is up to about one week, about one month, about six months, about one year, about two years, about three years, or more. In some embodiments, the time between obtaining the first clinical sample and obtaining the second clinical sample is in a range of one week to one month, one month to six months, one month to one year, six months to one year, six months to two years, one year to three years, or one year to five years.
According to some aspects of the invention, methods are provided for assessing the efficacy of a treatment for an autism spectrum disorder in an individual in need thereof. In some embodiments, the methods involve: (a) obtaining a clinical sample from the individual, (b) administering a treatment to the individual for the autism spectrum disorder, (c) determining an expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in the clinical sample, (e) comparing the expression pattern with an appropriate reference expression pattern, in which the appropriate reference expression pattern comprises expression levels of the at least two autism spectrum disorder-associated genes in a clinical sample obtained from an individual who does not have the autism spectrum disorder, in which the results of the comparison in (c) are indicative of the efficacy of the treatment.
In some embodiments, the methods for assessing efficacy of a treatment for an autism spectrum disorder involve (a) obtaining a first clinical sample from the individual, (b) administering a treatment to the individual for the autism spectrum disorder, (c) obtaining a second clinical sample from the individual after having administered the treatment to the individual, (d) determining a first expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in the first clinical sample, (e) comparing the first expression pattern with an appropriate reference expression pattern, in which the appropriate reference expression pattern comprises expression levels of the at least two autism spectrum disorder-associated genes in a clinical sample obtained from an individual who does not have the autism spectrum disorder, (f) determining a second expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in the second clinical sample, and (g) comparing the second expression pattern with the appropriate reference expression pattern, in which a difference between the second expression pattern and the appropriate reference expression pattern that is less than the difference between the first expression pattern and the appropriate reference pattern is indicative of the treatment being effective.
According to some aspects of the invention, methods are provided for selecting an appropriate dosage of a treatment for an autism spectrum associated disorder in an individual in need thereof. In some embodiments, the methods involve (a) administering a first dosage of a treatment for an autism spectrum associated disorder to the individual, (b) assessing the efficacy of the first dosage of the treatment, in part, by determining at least one expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in a clinical sample obtained from the individual, (c) administering a second dosage of a treatment for an autism spectrum associated disorder in the individual, (d) assessing the efficacy of the second dosage of the treatment, in part, by determining at least one expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in a clinical sample obtained from the individual, in which the appropriate dosage is selected as the dosage administered in (a) or (c) that has the greatest efficacy.
According to some aspects of the invention, methods are provided for selecting an appropriate dosage of a treatment for an autism spectrum associated disorder in an individual in need thereof. In some embodiments, the methods involve (a) administering a dosage of a treatment for an autism spectrum associated disorder to the individual; (b) assessing the efficacy of the dosage of the treatment, in part, by determining at least one expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in a clinical sample obtained from the individual, and (c) selecting the dosage as being appropriate for the treatment for the autism spectrum associated disorder in the individual, if the efficacy determined in (b) is at or above a threshold level, in which the threshold level is an efficacy level at or above which a treatment substantially improves at least one symptom of an autism spectrum disorder.
According to some aspects of the invention, methods are provided for identifying an agent useful for treating an autism spectrum associated disorder in an individual in need thereof. In some embodiments, the methods involve (a) contacting an autism spectrum associated disorder-cell with a test agent, (b) determining at least one expression pattern comprising expression levels of at least two autism spectrum disorder-associated genes in the autism spectrum disorder-associated cell, (c) comparing the at least one expression pattern with a test expression pattern, and (d) identifying the agent as being useful for treating the autism spectrum associated disorder based on the comparison in (c). In some embodiments, the test expression pattern is an expression pattern indicative of an individual who does not have the autism spectrum disorder, and in which a decrease in a difference between the at least one expression pattern and the test expression pattern resulting from contacting the autism spectrum disorder-associated cell with the test agent identifies the test agent as being useful for the treatment of the autism spectrum associated disorder. In some embodiments, the autism spectrum disorder-associated cell is contacted with the test agent in (a) in vivo. In some embodiments, the autism spectrum disorder-associated cell is contacted with the test agent in (a) in vitro.
Autism Spectrum Disorder (ASD) is a highly heritable neurodevelopmental disorder. Applicants have developed robust profiling methods that classify the ASD status in individuals. In some embodiments, Applicants have developed methods that are useful for classifying the ASD status in males. In other embodiments, Applicants have developed methods that are useful for classifying the ASD status in individuals of particular age groups. In some embodiments, a gene expression based classifier is provided that achieves clinically relevant classification accuracies of ASD status. In other embodiments, gene expression based classifiers are provided that discriminate among autistic disorder (AUT), pervasive developmental disorder—not otherwise specified (PDDNOS), and Asperger's disorder (ASP). In some embodiments, the profiling methods are useful for diagnosing individuals as having ASD. In some embodiments, the profiling methods are also useful for selecting, or aiding in selecting, a treatment for an individual who has ASD or who is suspected of having ASD.
The term “autism spectrum disorder” (which may also be referred to herein by the acronym, “ASD”) refers to a spectrum of neuropsychological conditions that cause severe and pervasive impairment in thinking, feeling, language, and the ability to relate to others. Individuals with autism spectrum disorder may have restricted and/or repetitive behaviors or interests. Autism spectrum disorder may be first suspected or diagnosed in early childhood and may range in severity from a severe form, called autistic disorder, or autism, through pervasive development disorder not otherwise specified (PDD-NOS), to a milder form, Asperger syndrome. Autism spectrum disorder may also include two rare disorders, Rett syndrome and childhood disintegrative disorder. As used herein, the phrase “diagnosing autism spectrum disorder” refers to diagnosing, or aiding in diagnosing, an individual as having autism spectrum disorder.
As described herein, a variety of genes are differentially expressed in individuals having autism spectrum disorder compared with individuals identified as not having autism spectrum disorder. An “autism spectrum disorder-associated gene” is a gene whose expression levels are associated with autism spectrum disorder. Examples of autism spectrum disorder-associated genes include, but are not limited to, the genes listed in Table 4, 5, 6, 8, 9, 10 or 11. In some embodiments, the autism spectrum disorder associated gene is a gene of Table 4. Further examples of autism spectrum disorder genes are provided in Tables 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, and 24.
As used herein, the term “autism spectrum disorder-associated cell” refers to a cell that expresses one or more autism spectrum disorder-associated genes. In some embodiments, an autism spectrum disorder-associated cell expresses at least two autism spectrum disorder associated genes. In some embodiments, an autism spectrum disorder-associated cell is a cell, obtained from an individual, that expresses autism spectrum disorder associated genes, the expression levels of which genes are useful for diagnosing or assessing the status of autism spectrum disorder in the individual. As used herein, the term “autism spectrum disorder-associated tissue” is a tissue comprising an autism spectrum disorder-associated cell.
The term “individual”, as used herein, refers to any mammal, including, humans and non-humans, such as primates. Typically, an individual is a human. An individual may be of any appropriate age for the methods disclosed herein. For example, methods disclosed herein may be used to characterize the autism spectrum disorder status of a child, e.g., a human in a range of about 1 to about 12 years old. An individual may be a non-human that serves as an animal model of autism spectrum disorder. An individual may alternatively be referred to herein synonymously as a subject.
Methods are provided herein for characterizing the autism spectrum disorder status of an individual in need thereof. An individual in need of a characterization of autism spectrum disorder status is any individual at risk of, or suspected of, having autism spectrum disorder. An individual's “autism spectrum disorder status” may be characterized as having autism spectrum disorder or as not having autism spectrum disorder.
An individual in need of diagnosis of autism spectrum disorder is any individual at risk of, or suspected of, having autism spectrum disorder. An individual at risk of having autism spectrum disorder may be an individual having one or more risk factors for autism spectrum disorder. Risk factors for autism spectrum disorder include, but are not limited to, a family history of autism spectrum disorder; elevated age of parents; low birth weight; premature birth; presence of a genetic disease associated with autism; and sex (males are more likely to have autism than females). Other risk factors will be apparent to the skilled artisan. An individual suspected of having autism spectrum disorder may be an individual having one or more clinical symptoms of autism spectrum disorder. A variety of clinical symptoms of Autism Spectrum Disorder are known in the art. Examples of such symptoms include, but are not limited to, no babbling by 12 months; no gesturing (pointing, waving goodbye, etc.) by 12 months; no single words by 16 months; no two-word spontaneous phrases (other than instances of echolalia) by 24 months; any loss of any language or social skills, at any age.
The methods disclosed herein may be used in combination with any one of a number of standard diagnostic approaches, including, but not limited to, clinical or psychological observations and/or ASD-related screening modalities, such as, for example, the Modified Checklist for Autism in Toddlers (M-CHAT), the Early Screening of Autistic Traits Questionnaire, and the First Year Inventory to facilitate or aid in the diagnosis of ASD. In some embodiments, methods disclosed herein are used to identify subgroups of ASD.
The methods disclosed herein typically involve determining expression levels of at least one autism spectrum disorder-associated genes in a clinical sample obtained from an individual. The methods may involve determining expression levels of at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, or more autism spectrum disorder-associated genes in a clinical sample obtained from an individual. The methods may involve determining expression levels of 1 to 10, 10 to 20, 20 to 30, 30 to 40, 40 to 50, 50 to 60, 60 to 70, 70 to 80, 80 to 90, 90 to 100, 100 to 200, 200 to 300, or 300 to 400 autism spectrum disorder-associated genes in a clinical sample obtained from an individual. The methods may involve determining expression levels of about 10, about 20, about 30, about 35, about 40, about 50, about 60, about 70, about 80, about 85, about 90, about 100, or more autism spectrum disorder-associated genes in a clinical sample obtained from an individual.
An expression level determining system may be used in the methods. The term “expression level determining system”, as used herein, refers to a set of components, equipment, and/or reagents, for determining the expression level of a gene in a sample. The expression level of an autism spectrum disorder-associated gene may be determined as the level of an RNA encoded by the gene, in which case, the expression level determining system may comprise components useful for determining levels of nucleic acids. The expression level determining system may comprises, for example, hybridization-based assay components, and related equipment and reagents, for determining the level of the RNA in the clinical sample. Hybridization-based assays are well known in the art and include, but are not limited to, oligonucleotide array assays (e.g., microarray assays), cDNA array assays, oligonucleotide conjugated bead assays (e.g., Multiplex Bead-based Luminex® Assays), molecular inversion probe assay, serial analysis of gene expression (SAGE) assay, RNase Protein Assay, northern blot assay, an in situ hybridization assay, and an RT-PCR assay. Multiplex systems, such as oligonucleotide arrays or bead-based nucleic acid assay systems are particularly useful for evaluating levels of a plurality of nucleic acids in simultaneously. RNA-Seq (mRNA sequencing using Ultra High throughput or Next Generation Sequencing) may also be used to determine expression levels. Other appropriate methods for determining levels of nucleic acids will be apparent to the skilled artisan.
The expression level of an autism spectrum disorder-associated gene may be determined as the level of a protein encoded by the gene, in which case, the expression level determining system may comprise components useful for determining levels of proteins. The expression level determining system may comprises, for example, antibody-based assay components, and related equipment and reagents, for determining the level of the protein in the clinical sample. Antibody-based assays are well known in the art and include, but are not limited to, antibody array assays, antibody conjugated-bead assays, enzyme-linked immuno-sorbent (ELISA) assays, immunofluorescence microscopy assays, and immunoblot assays. Other methods for determining protein levels include mass spectroscopy, spectrophotometry, and enzymatic assays. Still other appropriate methods for determining levels of proteins will be apparent to the skilled artisan.
As used herein, a “level” refers to a value indicative of the amount or occurrence of a molecule, e.g., a protein, a nucleic acid, e.g., RNA. A level may be an absolute value, e.g., a quantity of a molecule in a sample, or a relative value, e.g., a quantity of a molecule in a sample relative to the quantity of the molecule in a reference sample (control sample). The level may also be a binary value indicating the presence or absence of a molecule. For example, a molecule may be identified as being present in a sample when a measurement of the quantity of the molecule in the sample, e.g., a fluorescence measurement from a PCR reaction or microarray, exceeds a background value. Similarly, a molecule may be identified as being absent from a sample (or undetectable in a sample) when a measurement of the quantity of the molecule in the sample is at or below background value.
The methods may involve obtaining a clinical sample from the individual. As used herein, the phrase “obtaining a clinical sample” refers to any process for directly or indirectly acquiring a clinical sample from an individual. For example, a clinical sample may be obtained (e.g., at a point-of-care facility, e.g., a physician's office, a hospital) by procuring a tissue or fluid sample (e.g., blood draw, spinal tap) from an individual. Alternatively, a clinical sample may be obtained by receiving the clinical sample (e.g., at a laboratory facility) from one or more persons who procured the sample directly from the individual.
The term “clinical sample” refers to a sample derived from an individual, e.g., a patient. Clinical samples include, but are not limited to tissue (e.g., brain tissue), cerebrospinal fluid, blood, blood fractions (e.g., serum, plasma), sputum, fine needle biopsy samples, urine, peritoneal fluid, and pleural fluid, or cells therefrom (e.g., blood cells (e.g., white blood cells, red blood cells)). Accordingly, a clinical sample may comprise a tissue, cell or biomolecule (e.g., RNA, protein). In some embodiments, the clinical sample is a sample of peripheral blood, brain tissue, or spinal fluid.
It is to be understood that a clinical sample may be processed in any appropriate manner to facilitate determining expression levels of autism spectrum disorder-associated genes. For example, biochemical, mechanical and/or thermal processing methods may be appropriately used to isolate a biomolecule of interest, e.g., RNA, protein, from a clinical sample. A RNA sample may be isolated from a clinical sample by processing the clinical sample using methods well known in the art and levels of an RNA encoded by an autism spectrum disorder-associated gene may be determined in the RNA sample. A protein sample may be isolated from a clinical sample by processing the clinical sample using methods well known in the art. And levels of a protein encoded by an autism spectrum disorder-associated gene may be determined in the protein sample. The expression levels of autism spectrum disorder-associated genes may also be determined in a clinical sample directly.
The methods disclosed herein also typically comprise comparing expression levels of autism spectrum disorder-associated genes with an appropriate reference level. An “appropriate reference level” is an expression level of a particular autism spectrum disorder gene that is indicative of a known autism spectrum disorder status. An appropriate reference level can be determined or can be a pre-existing reference level. An appropriate reference level may be an expression level indicative of autism spectrum disorder. For example, an appropriate reference level may be representative of the expression level of an autism spectrum disorder-associated gene in a clinical sample obtained from an individual known to have autism spectrum disorder. When an appropriate reference level is indicative of autism spectrum disorder, a lack of a significant difference between an expression level determined from an individual in need of characterization or diagnosis of autism spectrum disorder and the appropriate reference level may be indicative of autism spectrum disorder in the individual. Alternatively, when an appropriate reference level is indicative of autism spectrum disorder, a significant difference between an expression level determined from an individual in need of characterization or diagnosis of autism spectrum disorder and the appropriate reference level may be indicative of the individual being free of autism spectrum disorder.
An appropriate reference level may be a threshold level such that an expression level being above or below the threshold level is indicative of autism spectrum disorder in an individual.
An appropriate reference level may be an expression level indicative of an individual being free of autism spectrum disorder. For example, an appropriate reference level may be representative of the expression level of a particular autism spectrum disorder-associated gene in a clinical sample obtained from an individual who does not have autism spectrum disorder. When an appropriate reference level is indicative of an individual who does not have autism spectrum disorder, a significant difference between an expression level determined from an individual in need of diagnosis of autism spectrum disorder and the appropriate reference level may be indicative of autism spectrum disorder in the individual. Alternatively, when an appropriate reference level is indicative of the individual being free of autism spectrum disorder, a lack of a significant difference between an expression level determined from an individual in need of diagnosis of autism spectrum disorder and the appropriate reference level may be indicative of the individual being free of autism spectrum disorder.
For example, when a higher level, relative to an appropriate reference level that is indicative of an individual who does not have autism spectrum disorder, of at least one autism spectrum disorder-associated gene, which is selected from: ZNF12, RBL2, ZNF292, IVNS1ABP, ZFP36L2, ARFGEF1, UTY, SLA, KIAA0247, HNRNPA2B1, RNF145, PTPRE, SFRS18, ZNF238, TRIP12, PNN, ZDHHC17, MLL3, MTMR10, STK38, SERINC3, NIPBL, TIGD1, DDX42, NUP50, CAB39, ROCK1, SULF2, FABP2, KIDINS220, NCOA6, SIRPA, PCSK5, ADAM10, ZNF33A, ZMAT1, C10orf28, MGAT4A, CEP110, ZZEF1, CREBZF, DOCK11, ATRN, COL4A3BP, FAM133A, TTC14, TMEM30A, MYO5A, KDM2A, ZCCHC14, RNF44, ZBTB44, CLTC, UTRN, ATXN7, PPP1R12A, LBR, TBC1D14, SPATA13, HK2, CREBBP, MED23, ZFYVE16, PAN3, RBBP6, AVL9, ZNF354A, ACTR2, TMBIM1, RPS6KA3, DNMBP, NBEAL2, MYSM1, TMEM2, SNRK, KIAA1109, HECA, DNAJC3, KIF5B, POLR2B, ANTXR2, VPS13C, MANBA, and NIN, is identified, the individual's autism spectrum disorder status may be characterized as having autism spectrum disorder. When a lower level, relative to an appropriate reference level that is indicative of an individual who does not have autism spectrum disorder, of at least one autism spectrum disorder-associated gene, which includes STXBP6, is identified, the individual's autism spectrum disorder status may be characterized as having autism spectrum disorder.
The magnitude of difference between an expression level and an appropriate reference level may vary. For example, a significant difference that indicates an autism spectrum disorder status or diagnosis may be detected when the expression level of an autism spectrum disorder-associated gene in a clinical sample is at least 1%, at least 5%, at least 10%, at least 25%, at least 50%, at least 100%, at least 250%, at least 500%, or at least 1000% higher, or lower, than an appropriate reference level of that gene. Similarly, a significant difference may be detected when the expression level of an autism spectrum disorder-associated gene in a clinical sample is at least 2-fold, at least 3-fold, at least 4-fold, at least 5-fold, at least 6-fold, at least 7-fold, at least 8-fold, at least 9-fold, at least 10-fold, at least 20-fold, at least 30-fold, at least 40-fold, at least 50-fold, at least 100-fold, or more higher, or lower, than the appropriate reference level of that gene. Significant differences may be identified by using an appropriate statistical test. Tests for statistical significance are well known in the art and are exemplified in Applied Statistics for Engineers and Scientists by Petruccelli, Chen and Nandram 1999 Reprint Ed.
It is to be understood that a plurality of expression levels may be compared with plurality of appropriate reference levels, e.g., on a gene-by-gene basis, as a vector difference, in order to assess the autism spectrum disorder status of the individual. In such cases, Multivariate Tests, e.g., Hotelling's T2 test, may be used to evaluate the significance of observed differences. Such multivariate tests are well known in the art and are exemplified in Applied Multivariate Statistical Analysis by Richard Arnold Johnson and Dean W. Wichern Prentice Hall; 4th edition (Jul. 13, 1998).
The methods may also involve comparing a set of expression levels (referred to as an expression pattern) of autism spectrum disorder-associated genes in a clinical sample obtained from an individual with a plurality of sets of reference levels (referred to as reference patterns), each reference pattern being associated with a known autism spectrum disorder status; identifying the reference pattern that most closely resembles the expression pattern; and associating the known autism spectrum disorder status of the reference pattern with the expression pattern, thereby classifying (characterizing) the autism spectrum disorder status of the individual.
The methods may also involve building or constructing a prediction model, which may also be referred to as a classifier or predictor, that can be used to classify the disease status of an individual. As used herein, an “autism spectrum disorder-classifier” is a prediction model that characterizes the autism spectrum disorder status of an individual based on expression levels determined in a clinical sample obtained from the individual. Typically the model is built using samples for which the classification (autism spectrum disorder status) has already been ascertained. Once the model is built, it may be applied to expression levels obtained from a clinical sample in order to classify the autism spectrum disorder status of the individual from which the clinical sample was obtained. Thus, the methods may involve applying an autism spectrum disorder-classifier to the expression levels, such that the autism spectrum disorder-classifier characterizes the autism spectrum disorder status of the individual based on the expression levels. The individual may be further diagnosed, e.g., by a health care provider, based on the characterized autism spectrum disorder status.
A variety of prediction models known in the art may be used as an autism spectrum disorder-classifier. For example, an autism spectrum disorder-classifier may be established using logistic regression, partial least squares, linear discriminant analysis, quadratic discriminant analysis, neural network, naïve Bayes, C4.5 decision tree, k-nearest neighbor, random forest, and support vector machine.
The autism spectrum disorder-classifier may be trained on a data set comprising expression levels of the plurality of autism spectrum disorder-associated genes in clinical samples obtained from a plurality of individuals identified as having autism spectrum disorder. For example, the autism spectrum disorder-classifier may be trained on a data set comprising expression levels of a plurality of autism spectrum disorder-associated genes in clinical samples obtained from a plurality of individuals identified as having autism spectrum disorder based on DSM-IV-TR criteria. The training set will typically also comprise control individuals identified as not having autism spectrum disorder, e.g., identified as not satisfying the DSM-IV-TR criteria. As will be appreciated by the skilled artisan, the population of individuals of the training data set may have a variety of characteristics by design, e.g., the characteristics of the population may depend on the characteristics of the individuals for whom diagnostic methods that use the classifier may be useful. For example, the interquartile range of ages of a population in the training data set may be from about 2 years old to about 10 years old, about 1 year old to about 20 years old, about 1 year old to about 30 years old. The median age of a population in the training data set may be about 1 year old, 2 years old, 3 years old, 4 years old, 5 years old, 6 years old, 7 years old, 8 years old, 9 years old, 10 years old, 20 years old, 30 years old, 40 years old, or more. The population may consist of all males, all females or may consist of males and females.
A class prediction strength can also be measured to determine the degree of confidence with which the model classifies a clinical sample. The prediction strength conveys the degree of confidence of the classification of the sample and evaluates when a sample cannot be classified. There may be instances in which a sample is tested, but does not belong, or cannot be reliable assigned to, a particular class. This is done by utilizing a threshold in which a sample which scores above or below the determined threshold is not a sample that can be classified (e.g., a “no call”).
Once a model is developed, the validity of the model can be tested using methods known in the art. One way to test the validity of the model is by cross-validation of the dataset. To perform cross-validation, one, or a subset, of the samples is eliminated and the model is built, as described above, without the eliminated sample, forming a “cross-validation model.” The eliminated sample is then classified according to the model, as described herein. This process is done with all the samples, or subsets, of the initial dataset and an error rate is determined. The accuracy the model is then assessed. This model classifies samples to be tested with high accuracy for classes that are known, or classes have been previously ascertained. Another way to validate the model is to apply the model to an independent data set, such as a new clinical sample having an unknown autism spectrum disorder status. Other appropriate validation methods will be apparent to the skilled artisan.
As will be appreciated by the skilled artisan, the strength of the model may be assessed by a variety of parameters including, but not limited to, the accuracy, sensitivity, specificity and area under the receiver operation characteristic curve. Methods for computing accuracy, sensitivity and specificity are known in the art and described herein (See, e.g., the Examples). The autism spectrum disorder-classifier may have an accuracy of at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 99%, or more. The autism spectrum disorder-classifier may have an accuracy score in a range of about 60% to 70%, 70% to 80%, 80% to 90%, or 90% to 100%. The autism spectrum disorder-classifier may have a sensitivity score of at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 99%, or more. The autism spectrum disorder-classifier may have a sensitivity score in a range of about 60% to 70%, 70% to 80%, 80% to 90%, or 90% to 100%. The autism spectrum disorder-classifier may have a specificity score of at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 99%, or more. The autism spectrum disorder-classifier may have a specificity score in a range of about 60% to 70%, 70% to 80%, 80% to 90%, or 90% to 100%.
Described herein are oligonucleotide (nucleic acid) arrays that are useful in the methods for determining levels of multiple nucleic acids simultaneously. Such arrays may be obtained or produced from commercial sources. Methods for producing nucleic acid arrays are well known in the art. For example, nucleic acid arrays may be constructed by immobilizing to a solid support large numbers of oligonucleotides, polynucleotides, or cDNAs capable of hybridizing to nucleic acids corresponding to mRNAs, or portions thereof. The skilled artisan is also referred to Chapter 22 “Nucleic Acid Arrays” of Current Protocols In Molecular Biology (Eds. Ausubel et al. John Wiley and #38; Sons NY, 2000), International Publication WO00/58516, U.S. Pat. No. 5,677,195 and U.S. Pat. No. 5,445,934 which provide non-limiting examples of methods relating to nucleic acid array construction and use in detection of nucleic acids of interest. In some embodiments, the nucleic acid arrays comprise, or consist essentially of, binding probes for mRNAs of at least 2, at least 5, at least 10, at least 20, at least 50, at least 100, at least 200, at least 300, or more genes selected from Table 6. Kits comprising the oligonucleotide arrays are also provided. Kits may include nucleic acid labeling reagents and instructions for determining expression levels using the arrays.
Autism Spectrum Disorder (ASD) relates to a broad spectrum of neurocognitive and social developmental delays including autistic disorder, pervasive developmental disorder—not otherwise specified and Asperger's Disorder as sub classified in the Diagnostic and Statistical Manual of Mental Disorders, 4th edition, Text Revision (DSM-W-TR). Onset of ASD may occur before 3 years of age. Reported prevalence of ASD has been increasing during the last decades, and a current estimation is 1 in 91. There are long waiting lists for evaluation at most centers with expertise. Progress has been made in adopting instruments such as the Autism Diagnostic Interview-Revised (ADI-R) and the Autism Diagnostic Observation Schedule (ADOS). In some cases, the median age at diagnosis is 5.7 years.
Early diagnosis and behavioral intervention may improve outcomes. This example provides diagnostic tests and/or biomarkers that can be used (e.g., in primary pediatric care centers) to reduce the time to accurate diagnosis. This example describes a gene expression study of ASD, and demonstrates the performance of blood expression signatures that classify children with ASD and distinguish ASD from controls. The signature may be useful for making a diagnosis, for example, after an increased index of suspicion is determined based on parent and/or pediatric assessment. Studies on an additional cohort were performed to further validate this signature.
Gene expression profiles of P1 were prepared using Affymetrix HG-U133 Plus 2.0 (U133p2) and those of P2 were profiled using Affymetrix Gene 1.0 ST (GeneST) arrays (Affymetrix, CA). Within the P1 data set, RNAs from 39 ASD and 12 control samples were isolated directly from whole blood using the RiboPure Blood Kit (Ambion). For all other blood samples, total RNA was extracted from 2.5 ml of whole venous blood using the PAX gene Blood RNA System (PreAnalytix). Quality and quantity of these RNAs was assessed using the Nanodrop spectrophotometer (Thermo Scientific) and Bioanalyzer System (Agilent). Fragmented cRNA was hybridized to the appropriate Affymetrix array and scanned on an Affymetrix GeneChip scanner 3000. cRNA from both affected and normal control population groups was prepared in batches consisting of a randomized assortment of the two comparison groups.
The gene expression levels were calculated using the probe log iterative error algorithm after normalizing the probe intensities using a quantile method. To identify differentially expressed genes in cases compare to controls, we used Welch's t-test for two groups comparison, and one-way analysis of variance with Dunnett's post hoc tests to find significantly changed genes in AUT, PDDNOS, or ASP compare to control group. A general linear model was used to evaluate the significance of diagnosis, gender, age, and the other covariates. P values were corrected for the multiple comparisons by calculating a false discovery rate (FDR). Fisher's exact test was used for categorical data. Spearman's rank correlation coefficients were calculated to evaluate correlation between continuous phenotypic variables such as age at blood drawing and expression level of each gene. The significance of correlation was determined using Fisher's r-to-z transformation. A machine learning method was used to build a prediction model using multi-gene expression profiles. Enriched biological pathways with predictor genes were found using the DAVID functional annotation system. Statistical analyses were performed using the R statistical programming language.
Prediction analyses were performed using the following sequential steps: 1) rank order genes for predictor selection, 2) set up a cross-validation strategy in the training set, 3) select prediction algorithm and build a prediction model, 4) predict a test set, and 5) evaluate prediction performance as illustrated in
First, all genes were ranked by Welch's t-test p-values between AUT+PDDNOS vs. controls in the P1 dataset. The top N differentially expressed genes from 10 to 395 by 5 were selected and used to build a prediction model with the P1 dataset using a repeated leave-group out cross-validation (LGOCV) strategy. For each prediction model using the top N genes, all P1 samples (N=99) were divided to 80% (a train set) and 20% (a test set), keeping the proportion of ASD and controls the same in each set. This step was repeated 100 times to estimate robust prediction performance. To optimize each prediction model further, an inner cross-validation approach was deployed where 80% of the samples served as an inner train set, and 20% were used as an inner test set. The inner cross-validation procedure was repeated 200 times to find optimal tuning parameters for the specific prediction algorithm used. For each prediction model with top N genes, a total of 20,000 predictions (100 repeated LGOCVs×200 inner cross-validations) had been made. A partial least squares (PLS) method was used to find the best performing model.
For each sample in a test set, the model predicts the probability of being classified as ASD. Thus, the number of false positives among positive predictions changes with the threshold. Overall prediction accuracy was calculated as (the number of true positives+the number of true negatives)/N, where N was the total number of samples in a dataset. Sensitivity, specificity, positive predictive value, and negative predictive value were presented as standard measures of prediction performance with the area under the receiver operation characteristic curve (AUC). Sensitivity was calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives. Specificity was calculated as the number of true negatives divided by the sum of the number of true negatives and the number of false positives. The receiver operating characteristic (ROC) curve summarizes the result at different thresholds. AUC was calculated from the ROC curve as AUC=∫00ROC(t)dt. AUC and root mean squared errors (RMSE) were used as performance measurements to decide the number of genes for the final prediction model. RMSE was as defined as Equation 1 where p is a probability of being ASD and a is an integer for each class (1 being ASD and 0 being control) for nth sample.
To find a relatively strong performing prediction model with the minimum description length, RMSEs of each prediction model were compared using the top N genes. The mean RMSEs improved gradually with increasing model complexities. As shown in
A total of 165 ASD and 103 control samples were run in replicates of four on the Biomark real time PCR system (Fluidigm, CA) using nanoliter reactions and the Taqman system (Applied Biosystems, CA). Following the Biomark protocol, quantitative RT-PCR (qRT-PCR) amplifications were carried out in a 9 nanoliter reaction volume containing 2× Universal Master Mix (Taqman), taqman gene expression assays, and preamplified cDNA. Pre-amplification reactions were done in a PTC-200 thermal cycler from MJ Research, per Biomark protocol. Reactions and analysis were performed using a Biomark system. The cycling program consisted of an initial cycle of 50° C. for 2 minutes and a 10 min incubation at 95° C. followed by 40 cycles of 95° C. for 15 seconds, 70° C. for 5 seconds, and 60° C. for 1 minute. Data was normalized to the housekeeping gene GAPDH, and expressed relative to control.
ASD patients were recruited. Study inclusion criteria comprised a clinical diagnosis of ASD by DSM-IV-TR criteria and an age >24 months. Patients with ASD recruited for this study have underwent diagnostic assessment, using ADOS and ADI-R, as well as clinical testing including cognitive testing, language measures, medical history, height and weight, head circumference, and behavioral questionnaires. Two independently collected data sets (hereafter P1 and P2) consisted of 66 and 104 ASD individuals. Patients with known syndromic disorders such as fragile X mental retardation, tuberous sclerosis, Landau-Kleffner syndrome, and Klinefelter syndrome were not included in this study.
A total of 115 controls were enrolled concurrently. Certain control patients were identified as healthy children with idiopathic short stature, including genetic short stature and constitutional delay of growth, and were having clinical blood draws. Clinical blood draw results were evaluated to confirm they were within normal limits (those that were not were withdrawn from the study). Certain other control patients were offered enrollment during a well-child visit that involved a routine blood draw (for example, to obtain lead levels). A diagnosis of a chronic disease, mental retardation, ASD, or neurological disorder was used as exclusion criteria from our control group. Complete phenotypic information is available with microarray data (Gene Expression Omnibus identifier GSE18123). Each cohort's clinical and demographic information is shown in Table 1.
There was no statistical difference in age between ASD and controls in the P1 (Welch's t-test P=0.29) or P2 cohort (P=0.73). Ages of ASD samples between the P1 and P2 populations were not different (P=0.52). Because of disease incidence discordance in males and females, with males 4 times more likely to develop the disease, and because a preliminary analysis revealed higher heterogeneity in RNA levels in females with ASD than in males, possibly due the smaller number of females or to the sexual dimorphism in the expression of the disorder, only males were included in the P1 cohort (both ASD and controls samples), which was used to build a prediction model for ASD. The performance of the predictive model was tested for both males and females in the P2 cohort (although the number female controls was higher than that of female ASD—Fisher's exact test P=0.01 in P2).
Expression studies were performed by microarray profiling using an earlier version of the Affymetrix array (U133p2) for the P1 data set and a later version (GeneST) for the P2 data set. To match the probeset identifiers from the two different platforms used in this study, a Best Match subset was used. 29,129 out of 54,613 total probesets on U133p2 were best matched to 17,984 unique probesets of GeneST array, and these matched probesets were used for further analysis. After selecting the best matching probesets between two platforms, principal component analysis was performed to project samples onto the first two principal components (
There were 291 and 4039 genes differentially expressed between ASD and controls in the P1 and P2 datasets, respectively (Welch's t-test P<0.001, corresponding FDRs 0.029 (P1), and 0.0023 (P2)). Of these, 67 genes were significant in both cohorts, as set forth in Table 8. Three genes were randomly selected from the differentially expressed genes in the P1 dataset, and validated changes using quantitative RT-PCR in the P2 and additional samples (total N=165 for ASD and N=103 for controls) (Table 5). All 3 genes, LRRC6, SULF2, and YES1 were significantly up-regulated. When each diagnostic subtype was compared to controls in the P1 dataset, 100, 43, and 9 genes (as set forth in Tables 9, 10, and 11, respectively) were significant for autistic disorder (AUT), pervasive developmental disorder—not otherwise specified (PDDNOS), and Asperger's disorder (ASP) respectively (Welch's t-test P<0.001, corresponding FDRs 0.13 (AUT), 0.31 (PDDNOS), and 1.0 (ASP)). Among the significant genes in ASP, only one gene overlapped with AUT vs. control. None of the significant genes in ASP was differentially expressed in the patients with PDDNOS compared to controls. Interestingly, a larger number of genes were differentially expressed when 9 ASP cases were excluded, and compared ASD with control. A total of 395 genes were significant when the ASP samples were excluded compared to 291 genes when the ASP samples were included at the same statistical threshold (P<0.001, corresponding FDR 0.02 for 395 genes and 0.029 for 291 genes).
To determine which biological processes were implicated by the differentially expressed genes in ASD, an enrichment calculation was performed using a hypergeometric test. This metric allowed a determination of which processes were overrepresented in the 395 top most differentially expressed genes when the ASP samples were excluded (P<0.001, corresponding FDR 0.02) relative to all the processes annotated in the Kyoto Encyclopedia of Genes and Genomes (KEGG). These results are enumerated in Table 3. In this experiments, the Neurotrophin signaling pathway (KEGG pathway identifier: hsa04722) was the most significant (hypergeometric test P=0.0011, FDR 0.012) among 14 overrepresented pathways (hypergeometric test P<0.05, corresponding FDR 0.39). The Neurotrophin signaling pathway includes neurotrophins and their second messenger systems such as the MAPK pathway, PI3K pathway, and PLC pathway, which have been identified by others as important for neural development, learning and memory, and syndromic ASD such as tuberous sclerosis and Smith-Lemli-Opitz syndrome. The second most significant pathway in this experiment was the Long-term potentiation pathway (hypergeometric test P=0.0029, FDR 0.032).
Peripheral blood gene expression profiles may be used as a molecular diagnostic tool for identifying ASD from controls. A repeated leave-group out cross-validation (LGOCV) strategy was used with P1 to build prediction models. The training set, which consisted of the P1 cohort, was utilized to determine a classification signature (the combination of gene expression measurements) that was used to classify ASD patients in P1 (compared to controls). Genes were ranked according to p-values from AUT+PDDNOS vs. controls comparison in P1 since the differentially expressed genes were more prominent when AUT and PDDNOS samples were compared to controls without the ASP samples. This signature was then tested against the samples in an independent validation cohort (P2). The top N differentially expressed genes (where N ranges from 5 to 395 by 5) were used to build prediction models using a repeated 5-folds LGOCV with a partial least squares (PLS) method, and root mean squared errors (RMSE) were calculated (see Example 1). Mean RMSEs improved gradually when the number of genes was increased to build more complex prediction models; however, the prediction model that used the top 85 genes performed significantly better than the 80 gene model (t-test P=3.59×10−16) (
The accuracy of this 85-gene set (hereafter referred to as ASD85) within P1 was relatively high (area under the receiver operating characteristic curve (AUC) 0.96, 95% confidence interval (CI), 0.930-0.996), and also had good performance when applied to the P2 validation population (AUC 0.73, 95% CI 0.654-0.799) (Table 2). When generating a set of genes to classify samples, a tradeoff between specificity and sensitivity may be considered to achieve optimal results as shown by the Receiver Operating Characteristic curves in
The receiver operating characteristic (ROC) curve analysis was performed to evaluate the prediction accuracy (
In assessing robustness of the predictor for ASD classification, the expression data for potential confounders was evaluated. Among the demographic and clinical features, age at the time of blood draw may significantly influence gene expression. Within the ASD group, age at blood collection was correlated within the 389 genes at a significance level of P<0.001 (Spearman's rank correlation test, N=66, corresponding FDR 0.018). The one carbon pool by folate pathway (KEGG ID: hsa00670) was significantly enriched with 389 age-correlated genes in the ASD population (hypergeometric test P=6.7×10−7, FDR 7.7×104). The age-correlated genes in this pathway were MTHFD1, TYMS, SHMT2, ATIC, MTHFD1L, and GART. The ASD85 genes were not significantly correlated with age except for CEP110, CREBZF, C10orf28, and UTY across the patients with ASD. In the P1 control group (N=33), 163 genes correlated significantly with age, but none of the ASD85 genes were among them.
Several other clinical and developmental characteristics were also correlated with gene expression changes as summarized in Table 4. The positive history of developmental delay including a delay in hitting milestones such as sitting, crawling, walking, and speaking was associated with 11 genes including ARX. The aristaless related homeobox (ARX) is a homeodomain transcription factor that plays roles in cerebral development and patterning, and is implicated in X-linked mental retardations. ARX was not differentially expressed in the ASD group of P1 (P=0.64); however, it was significantly down-regulated in the individuals with positive history of developmental delay (P=0.00037, FDR 0.31).
In the P1 cohort, 9 patients with ASD were diagnosed with learning disorders. Sixty-four genes were differentially expressed with regard to learning disorders (Positive History N=9, Negative History N=90, P<0.001, corresponding FDR 0.14). The calcium signaling pathway (KEGG ID: hsa04020) was significant (hypergeometric P=0.023, FDR 0.19) with ADRA1B, CHRM2, PPP3R1, and P2RX3. The Synapsin 2 (SYN2), one of the 64 differentially expressed genes in the patients co-diagnosed with learning disorders, is a synaptic vesicle-associated protein that has been implicated in modulation of neurotransmitter release and in synaptogenesis. A brain gene expression study showed that SYN2 was down-regulated in the prefrontal cortex of schizophrenic patients. The differentially expressed genes that were correlated with other clinical conditions including psychiatric, neurological, gastrointestinal disorders, and seizure disorder are summarized in Table 4.
This example demonstrates, among other things, the usefulness of gene expression profiling to distinguish ASD patients from control samples, with an average accuracy of 72.5% in one population (the P1 cohort) and greater than 72.7% in an independently collected validation population (P2).
The performance of the classification in this example is notable in part because the two groups were relatively heterogeneous and were profiled using two different array-types. The classification of 73% of cases by expression profiling contrasts with the small percentage of ASD cases characterized through genetic mutations or structural variations to date. It also compares favorably to the performance of CMA, which accounts for 7-10% of cases of ASD. Together, these results indicate that gene expression signatures, which comprise multiple perturbed pathways, may serve as signals of genetic change in many patients. Moreover, in some embodiments, peripheral blood cells may be used as a surrogate for gene expression in the developing nervous system.
The biological processes implicated by the differentially expressed genes identified in this example are of interest in part because some of the pathways link to synaptic activity-dependent processes (i.e., Long-Term Potentiation and Neurotrophin signaling pathway in Table 3), for which several ASD mutations have been found. Immune/inflammation pathways were also identified in this analysis (e.g. Chemokine signaling pathway and Fc gamma R-mediated phagocytosis).
CREBBP, RPS6KA3, and NIPBL are associated with mental retardation. Heterozygous mutation of CREBBP is indicated in Rubinstein-Taybi syndrome, of which the core symptom is mental retardation (MIM ID #180849). Coffin-Lowry syndrome (MIM ID #303600) is associated with mutations in RPS6KA3 on chromosome Xp22.12, and is characterized by skeletal malformation, growth retardation, cognitive impairments, hearing deficit, and paroxysmal movement disorders. Mutations in NIPBL result in Cornelia de Lange syndrome (MIM ID #122470), a disorder characterized by dysmorphic facial features, growth delay, limb reduction defects as well as mental retardation.
Moreover, DOCK8 is significantly differentially expressed in ASD (P=3.05×104). Two unrelated patients possessed heterozygous disruptions of the DOCK8 gene, one by deletion and one by a translocation breakpoint; these disruptions are associated with mental retardation and developmental disability (MRD2, MIM ID #614113). In the P2 dataset, 13 differentially expressed genes were associated with mental retardation. These were ATP6AP2, ATRX, CRBN, FXR1, IGF1, INPPSE, KIAA2022, NUFIP2, RPS6KA3, TECT, UBSE2A, and ZDHHC9. The RPS6KA3 was significant in both P1 and the male samples in the P2 datasets. Four out of 66 ASD cases of P1 dataset had mild mental retardation. The comparison of 4 cases with mild mental retardation against 62 ASD cases in P1 found 95 differentially expressed genes (P<0.001, corresponding FDR 0.09).
The differentially expressed genes in the patients with ASP were distinct from the ones in AUT vs. controls or PDDNOS vs. controls. In one embodiment, more genes were differentially expressed without ASP samples compared to with ASP at the same statistical stringency. Since the median age was older for ASP group (9.2, range 4-16) compared to AUT+PDDNOS (6.8, range 3.4-17.5), differential expression was evaluated to determine if it was confounded by age. The expression of PNOC, one of the differentially expressed genes in ASP vs. controls, was correlated with age in the P1 (P=6.42E-05). However, the other significant genes in ASP were not correlated with age in this example.
Expression profiling also identified chromosomal abnormalities. For instance, an affected male that had high expression of the X-inactive-specific transcript (XIST); the expression values were comparable to those of females. Subsequent karyotyping confirmed Klinefelter syndrome in this individual, and the case was excluded in this study for further analysis.
In this example, two data sets were obtained at different times and the methods for RNA acquisition and microarrays used in P1 differed in part from those in P2. Also, the control population in P2 versus P1 differed in the clinics from which they were drawn, and the race and ethnic backgrounds of the patients and control population were not completely matched. Nonetheless, analysis of the independent datasets demonstrates the accuracy of the classifier. Also, the accuracy obtained in this example demonstrates that the geneset used includes predictive biomarkers.
This example provides the results of a blood transcriptome analysis that aims to identify differences in 170 ASD and 115 age/sex-matched controls and to evaluate the utility of gene expression profiling as a tool to aid in the diagnosis of ASD. Differentially expressed genes were enriched for the neurotrophin signaling, long-term potentiation/depression, and notch signaling pathways, among other pathways. A 55-gene prediction model was developed, using a cross-validation strategy, on a sample cohort of 66 male ASD and 33 age-matched male controls (referred to in Example 3 as P1*). Subsequently, 104 ASD and 82 controls were recruited and used as a validation set (referred to in Example 3 as P2*). This 55-gene expression signature achieved 68% classification accuracy with the validation cohort (area under the receiver operating characteristic curve (AUC): 0.70 [95% confidence interval [CI]: 0.62-0.77]). The prediction model was built and trained with male samples and performed well for males (AUC 0.73, 95% CI 0.65-0.82) The prediction model when applied to female samples had the following performance characteristics: AUC 0.51, 95% CI 0.36-0.67. The 55-gene signature also performed robustly when the prediction model was trained with P2* male samples to classify P1* samples (AUC 0.69, 95% CI 0.58-0.80). The results, which are outlined in Tables 12-24, indicate feasibility of the use of blood expression profiling for ASD detection. Table 18 outlines the differentially expressed genes in P1* data set. Table 19 outlines differentially expressed genes in P2* data set. Table 20 outlines top 6 clusters of Gene Ontology biological process terms enriched for differentially expressed genes in P1* data set. Table 21 outlines the 55 predictor genes. Table 22 outlines the prediction performances of ASD55 using various machine learning algorithms. Table 23 outlines the functional enrichment of genes in ASD55. Table 24 outlines pathways enriched with age-correlated genes.
Expression studies were performed by microarray profiling using an earlier version of the Affymetrix array (U133p2) for the P1* data set and a later version (GeneST) for the P2* data set. After selecting the best matching probesets between the two platforms, principal component analysis was performed to project samples into the first two principal components. P1* and P2* samples did not form two clusters after combining the two datasets, which were centered and scaled independently.
There were 489 and 610 transcripts differentially expressed between ASD and controls in the P1* and P2* datasets, respectively (Welch's t-test P<0.001, corresponding FDRs 0.029 (P1*), and 0.023 (P2*)) (Tables 12 and 13). 23 genes—ARID4B, ARMCX3, C10orf28, CTBP2, DDX3Y, JRKL, MTERFD3, NFYA, NGEF, PNN, RLF, RNF145, TIGD1, TUBB2A, UTY, YES1, ZNF117, ZNF322, ZNF445, ZNF514, ZNF518B, ZNF540, and ZNF763—were significant in both cohorts. To calculate the significance of this overlap, sample labels were shuffled in both data sets 200,000 times and counted the number of permutations with as many or more overlapping genes. Out of 200,000 permutations, only 2 had at least 23 overlapping genes between the two data sets, yielding a permutation P=10−5. The overlap of 23 genes also showed a significant trend using the hypergeometric distribution (P=0.0721). In the P2* dataset, 352 genes were significant for male patients compared to male controls while 48 genes were significant for female groups (Welch's t-test P<0.001, corresponding FDRs 0.028 (P2* males) and 0.60 (P2* females)). POLR3H was differentially expressed in both males and females.
Twelve of the 489 differentially expressed genes in the P1* dataset were selected for validation by quantitative RT-PCR. The 12 genes had an average fold change between ASD and controls greater than 1.5 and a mean expression level on the array greater than 150. These were CREBZF, HNRNPA2B1, KIDINS220, LBR, MED23, RBBP6, SPATA13, SULF2, TMEM30A, ZDHHC17, ZMAT1, and ZNF12. Eleven genes were validated using qRT-PCR (Table 13).
For immune response and synaptic gene sets, robust Mahalanobis distances (RDs) were calculated for all P1* samples. (
Receiver operating characteristic (ROC) curve analysis was performed to evaluate the prediction accuracy as seen in
In
A global gene expression profile of the Training set (P1*) and the Validation set (P2*) samples is depicted in
The prediction model selection procedure, shown in
PTPRE was found in common for each diagnostic subgroup vs. control (
When each diagnostic subtype was compared to controls in the P1* dataset, 178, 56, and 3 genes were significant for autistic disorder (AUT), pervasive developmental disorder—not otherwise specified (PDDNOS), and Asperger's disorder (ASP), respectively (One-way analysis of variance (ANOVA) with Dunnett's post hoc test P<0.001, corresponding FDRs 0.076 (AUT), 0.24 (PDDNOS), and 1.0 (ASP)). Among the genes identified as significant in ASP, PTPRE, overlapped with the AUT vs. control or PDDNOS vs. control comparisons while 36 genes were in common between AUT vs. control and PDDNOS vs. control (
Four of 66 ASD cases in the P1* dataset had mild mental retardation. When the 4 ASD cases with mild mental retardation were compared to the 62 ASD cases without mental retardation, 70 differentially expressed genes (P<0.001, corresponding FDR 0.12) were found
Expression profiling also identified chromosomal abnormalities. For instance, an affected male that had high expression of the X-inactive-specific transcript (XIST) was identified; the expression values were comparable to those of females. Subsequent karyotyping confirmed Klinefelter syndrome in this individual, and the case was excluded in this study for further analysis.
A modified Fisher's exact test (i.e., Expression Analysis Systematic Explorer [EASE] score) was used to determine what biological pathways were enriched with the differentially expressed genes in P1* using the DAVID functional annotation system. This metric allowed for the calculation of which processes were overrepresented in the 489 differentially expressed genes in P1* relative to all the processes annotated in the Kyoto Encyclopedia of Genes and Genomes (KEGG). These results are detailed in Table 15. In brief, the neurotrophin signaling pathway (KEGG pathway identifier: hsa04722) was the most significant (EASE score P=0.00023, FDR 0.0026) among 22 overrepresented pathways (EASE score P<0.05, corresponding FDR 0.44). The neurotrophin signaling pathway includes neurotrophins and their second messenger systems such as the MAPK pathway, PI3K pathway, and PLC pathway. Interestingly, long-term potentiation and long-term depression pathways were also significant (EASE score P=0.011, FDR 0.11, and P=0.042, FDR 0.39 respectively). The 22 overrepresented pathways were grouped according to the number of shared genes by calculating Cohen's kappa score. Two enriched clusters of 15 and 3 pathways were significant (Cohen's kappa>0.5) with progesterone-mediated oocyte maturation belonging to both clusters. Five other pathways—notch signaling pathway, lysosome, leukocyte transendothelial migration, endocytosis, and MAPK signaling pathway—were not clustered with the others (Table 15).
Given that multiple pathways were significantly enriched with the differentially expressed genes, the heterogeneity of perturbation was investigated across samples. All the significant genes in the top 14 pathways, from neurotrophin signaling to the VEGF pathway (Table 15), were grouped together as pathway cluster 1. A majority of these genes were associated with immune response. The genes in the long-term potentiation and long-term depression pathways were grouped as pathway cluster 2. In this cluster, synaptic genes were enriched. When the samples were plotted in a multidimensional space corresponding to the two pathway clusters (
To test whether peripheral blood gene expression profiles could be used as a molecular diagnostic tool for identifying ASD, a repeated leave-group out cross-validation (LGOCV) strategy was used with P1* to build a prediction model. First, the training set (P1*) was utilized to determine a classification signature (i.e. a combination of gene expression measurements) that was used to classify ASD patients in P1* (compared to controls). Next, the 489 differentially expressed genes were ranked according to their area under the receiver operating characteristic (ROC) curve (AUC). Next, those genes with low expression were excluded, requiring the minimum expression level across all samples to be at least 150. A total of 391 differentially expressed genes were then utilized in building the prediction models, which were subsequently tested against the samples in the independent validation cohort (P2*). The top N genes (where N ranges from 10 to 390 incremented by 5) were used to build prediction models using a repeated 5-folds LGOCV with a partial least squares (PLS) method, and AUCs were calculated for each cross-validation instance (see Methods). The prediction model using the top 55 genes was the most stable from 100-repeated LGOCV, having the smallest coefficient of variation in AUCs from 100 trials. The top 55 genes performed significantly better than the 50-gene model (one sided t test P=0.00031). The 55-gene prediction model was chosen because it minimized description length—i.e., the number of predictor genes—while maintaining good prediction performance, and used it to evaluate the independent dataset, P2*. The 55 significant genes are listed in Table 21. The performance of PLS was comparable to that of other prediction algorithms (Table 22); thus the classification performance was not attributable to a specific prediction algorithm.
The accuracy of this 55-gene set (also referred to as ASD55) within P1* was relatively high which is consistent with P1* being the training set (AUC 0.98, 95% confidence interval (CI), 0.965-1.000), but ASD55 also had good performance when applied to the P2* validation population (AUC 0.70, 95% CI 0.623-0.773) (Table 16). When generating a set of genes to classify samples, a tradeoff between specificity and sensitivity must be considered to achieve optimal results as shown by the ROC curves in
Overall, the ASD55 predictor genes were enriched with 2 KEGG pathways (TGF-beta signaling pathway and Neurotrophin signaling pathway) and 8 Gene Ontology biological process terms (Table 23). 29 out of 55 predictor genes were associated with expression in the brain according to enrichment analysis using DAVID on UniProt tissue expression categories (UP_TISSUE, EASE score P=0.071, FDR 53.88). Also, hierarchical clustering of samples in P1* by the ASD55 predictor genes showed a clear distinction between patients and controls (
In order to ensure that the predictor was robust for ASD classification, the expression data for potential confounders was reviewed. Among the demographic and clinical features, age at time of blood draw significantly influenced gene expression. Within the ASD group, age at blood collection was correlated within 382 genes at a significance level of P<0.001 (Spearman's rank correlation test, N=66, corresponding FDR 0.018). Six KEGG pathways were significantly enriched with the 382 age-correlated genes in the P1* ASD population (Table 24). The carbon pool by folate pathway (KEGG ID: hsa00670) was the most significantly enriched with age-correlated genes (EASE score P=4.6×10−7, FDR 5.2×104). The age-correlated genes in this pathway were MTHFD1, TYMS, SHMT2, ATIC, DHFR, MTHFD1L, and GART. The ASD55 genes were not significantly correlated with age except for CNTRL and UTY, which were correlated with age in patients but not controls. UTY was one of the 23 genes that were differentially expressed in both datasets (P1* and P2*). In the P1* control group (N=33), 163 genes correlated significantly with age, but none of the ASD55 genes were among them.
Several other clinical and developmental characteristics were also correlated with gene expression changes as summarized in Table 17. A positive personal history of developmental delay including a delay in hitting milestones such as sitting, crawling, walking, and speaking was associated with 12 genes including the aristaless related homeobox gene (ARX). ARX is a homeodomain transcription factor that plays crucial roles in cerebral development and patterning, and is implicated in X-linked mental retardations. ARX was not identified as being differentially expressed in the ASD group of P1 (P=0.74); however, it was significantly down-regulated in the individuals with positive history of developmental delay (P=0.00037, FDR 0.30).
In the P1* cohort, 9 patients with ASD were diagnosed with leaning disorders. Sixty-four genes were differentially expressed with regard to learning disorders (Positive History N=9, Negative History N=90, P<0.001, corresponding FDR 0.14). The calcium signaling pathway (KEGG ID: hsa04020) was significant (hypergeometric P=0.023, FDR 0.19) due to ADRA1B, CHRM2, PPP3R1, and P2RX3. Another gene differentially expressed in patients with learning disorders, Synapsin 2 (SYN2), is a synaptic vesicle-associated protein. The differentially expressed genes that were correlated with other clinical conditions including psychiatric, neurological, gastrointestinal disorders, and seizure disorder are summarized in Table 17.
Gene expression levels were calculated using Affymetrix Power Tools version 1.10 (Affymetrix, CA). The Probe Log Iterative ERror (PLIER) algorithm was used that includes a probe-level quantile normalization method for each microarray platform separately. To match the probeset identifiers from the two different platforms used in this study, a Best Match subset was used between the two. 29,129 out of 54,613 total probesets on U133p2 were best-matched to 17,984 unique probesets of the GeneST array, and these matched probesets were used for the cross-platform prediction analysis. For the genes represented by more than two U133p2 probesets, the genes for which all probesets changed to the same direction were included.
To identify hidden confounders such as batch effect, surrogate variable analysis (SVA) was performed with null model for batch effect. For the P1* dataset, SVA found 6 surrogate variables in residuals after fitting with the primary variable of interest, i.e., clinical diagnosis. The first surrogate variable significantly correlated with the year when the microarray profiling was performed. In the P2* dataset, a batch with 12 samples was grouped separately from the other 172 samples from a principal component analysis although none of the surrogate variables was correlated with the 12 outlier samples. The ComBat algorithm was used to reduce the batch effects in P1* and P2* independently as the two array platforms are different in the design of probe sequences such that U133p2 array uses both perfect match (PM) and mismatch (MM) probes while GeneST array only has PM probes. All statistical analyses were performed with the ComBat corrected expression data.
To identify differentially expressed genes in cases compared to controls, several tests were used, the Welch's t-test for two group comparison, and a one-way analysis of variance with Dunnett's post hoc tests to find significantly changed genes in AUT, PDDNOS, or ASP compared to the control group. To identify differentially expressed genes in the P2* dataset, the significance of diagnosis and gender was determined by two-way analysis of variance and follow-up Welch's t-test for each gender and Dunnett's post hoc tests for subtypes. The threshold for differential expression was set at nominal p-value <0.001. A general linear model was used to evaluate the significance of diagnosis, gender, age, and the other covariates. p-values were corrected for multiple comparisons by calculating a false discovery rate (FDR). Fisher's exact test was used for categorical data. Spearman's rank correlation coefficients were calculated to evaluate correlation between continuous phenotypic variables such as age at blood drawing and the expression level of each gene. The significance of correlation was determined using Fisher's r-to-z transformation. Enriched biological pathways with predictor genes were found using the DAVID functional annotation system. For significant KEGG pathways, the robust Mahalanobis distance of each individual was calculated from the common centroid of all cases and controls to find outliers using the minimum covariance determinant estimator. A quantile of the Chi-squared distribution (e.g., the 97.5% quantile) was used as a cut-off to define outliers, because for multivariate normally distributed data the Mahalanobis distance values are approximately chi-squared distributed. These outliers can be interpreted as biologically distinct subgroups for each pathway. Statistical analyses were performed using the R statistical programming language, and robust multivariate outlier analysis was performed using the chemometrics R library package.
Prediction analysis was performed in the following sequential steps; 1) ranking genes for predictor selection, 2) setting up a cross-validation strategy in the training set, 3) tuning parameters and building prediction models, and 4) predicting a test set, and evaluating prediction performances (
For each sample in a test set, the model predicts the probability of being classified as ASD. Thus, the number of false positives among positive predictions changes with the threshold. Overall prediction accuracy was calculated as (the number of true positives+the number of true negatives)/N, where N was the total number of samples in a dataset. Sensitivity, specificity, positive predictive value, and negative predictive value were presented as standard measures of prediction performance with AUC. The ROC curve summarizes the result at different thresholds.
To find a high performing prediction model with a minimum description length, AUCs between prediction models were compared using the top N genes. The mean AUCs improved gradually with increasing model complexities. However, it was also possible to identify the most stable prediction model by calculating the coefficient of variation of AUCs with 100 trials of outer cross validations. 5 additional prediction methods were tested: Logistic regression, Naïve Bayes, k-Nearest Neighbors, Random Forest, and Support Vector Machine using 55 genes with 5 fold LGOCV strategy. Statistical prediction analysis was performed using the caret and RWeka R library packages.
A total of 12 genes using 30 ASD and 30 control samples from the P1 population were run in replicates of four on the Biomark real time PCR system (Fluidigm, CA) using nanoliter reactions and the Taqman system (Applied Biosystems, CA). 60 samples were used. Following the Biomark protocol, quantitative RT-PCR (qRT-PCR) amplifications were carried out in a 9 nanoliter reaction volume containing 2× Universal Master Mix (Taqman), taqman gene expression assays, and preamplified cDNA. Pre-amplification reactions were done in a PTC-200 thermal cycler from MJ Research, per Biomark protocol. Reactions and analysis were performed using a Biomark system. The cycling program consisted of an initial cycle of 50° C. for 2 minutes and a 10 min incubation at 95° C. followed by 40 cycles of 95° C. for 15 seconds, 70° C. for 5 seconds, and 60° C. for 1 minute. Data was normalized to the housekeeping gene GAPDH, and expressed relative to control. All primers used for the 12 genes are listed in Table 13.
The meanings of certain abbreviations used in the specification are provided below.
ASD—autism spectrum disorders
ROC—receiver operating characteristic
AUC—area under the receiver operating characteristic curve
CI—confidence interval
DSM-IV-TR—Diagnostic and Statistical Manual of Mental Disorders, 4th edition, Text Revision
ADI-R—autism diagnostic interview-revised
ADOS—autism diagnostic observation schedule
CMA—chromosomal microarray analysis
U133p2—Affymetrix HG-U133 Plus 2.0 array
GeneST—Affymetrix Human Gene 1.0 ST array
FDR—false discovery rate
qRT-PCR—quantitative realtime polymerase chain reaction
PLS—partial least squares
ASD245—a prediction model with 245 genes
This invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having,” “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
Having thus described several aspects of at least one embodiment of this invention, it is to be appreciated various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description and drawings are by way of example only.
This application claims priority under 35 U.S.C. §119 to U.S. provisional patent application, U.S. Ser. No. 61/553,914, filed Oct. 31, 2011, entitled “Methods and Compositions for Characterizing Autism Spectrum Disorder Based on Gene Expression Patterns,” and U.S. provisional patent application, U.S. Ser. No. 61/710,646, filed Oct. 5, 2012, entitled “Methods and Compositions for Characterizing Autism Spectrum Disorder Based on Gene Expression Patterns,” the entire contents of which are incorporated herein by reference.
This invention was made with United States Government support under grants R01MH085143 and P30HD018655 awarded by, respectively, the National Institute of Mental Health and the National Institute of Child Health & Human Development of the National Institutes of Health. The United States government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2012/062735 | 10/31/2012 | WO | 00 | 4/29/2014 |
Number | Date | Country | |
---|---|---|---|
61710646 | Oct 2012 | US | |
61553914 | Oct 2011 | US |