The instant application contains a Sequence Listing which has been filed electronically in ASCII format and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Jan. 24, 2019, is named 05934_ST25.txt and is 1100 bytes in size.
The instant application contains four data tables which have been filed electronically and each table is hereby incorporated by reference in its entirety. The four data tables were created on Jan. 28, 2019, and are named as follows (with size in parentheticals): E_Data_Table_1.txt (70 KB), E_Data_Table_2. txt (16 KB), E_Data_Table_3. txt (13 MB), and E_Data_Table_4. txt (1 MB).
The invention is generally directed to methods and processes for genetic data evaluation, and more specifically to methods and systems utilizing genetic data involving multifactorial traits and/or disorders and applications thereof.
Within a typical mammalian genome, the coding DNA (i.e., DNA gene sequences that encode proteins) makes up a very small portion. For example, approximately 2% of the human genome contains sequence that encodes protein. The rest of the genome is noncoding DNA.
Noncoding DNA has long thought to be nonfunctional and often referred to as “junk” DNA. It is now understood, however, that noncoding DNA does in fact have several functions. These functions include encoding various noncoding RNA (e.g., transfer RNA, ribosomal RNA, snoRNA) and regulating gene function. Noncoding DNA can regulate gene transcription and translation by recruiting various transcriptional and posttranscriptional regulatory factors to a gene via various sequence elements. Various transcriptional sequence elements includes transcription factor binding sites, operators, enhancers, silencers, promoters, transcriptional start sites, and insulators. Various posttranscriptional sequence elements include RNA binding protein (RBP) sites, splice acceptors, splice donors, and cis-acting sequence elements.
Several embodiments are directed to methods and processes to evaluate variants that affect biochemical regulation.
In an embodiment to treat an individual for a medical disorder, genetic material of an individual that includes a set of genomic loci is sequenced. Each locus of the set of genomic loci contains sequence that has been determined to harbor a pathogenic variant that affects at least one biochemical regulatory process. The effect of harboring a pathogenic variant within each genomic loci has been associated with the pathogenicity of a medical disorder as determined by the effects of the variant on the at least one biochemical regulatory process. A set of variants that reside within the set of genomic loci sequenced is identified. A trained computational model to determine pathogenicity of each variant of the set of variants identified is obtained. The pathogenicity of each variant is based upon an aggregation of the variant's effects upon the at least one biochemical regulatory process. The computational model is trained utilizing a set of known pathogenic variants and a set of null variants. Utilizing the trained computational model, a diagnosis of the individual is determined based upon a cumulative pathogenicity score of the individual. The diagnosis indicates a propensity for the medical disorder. The cumulative pathogenicity score is determined by aggregating pathogenicity of the individual's variants within the set of genomic loci. When the individual is determined to have a diagnosis indicating a propensity for the medical disorder, the individual is treated for the medical disorder.
In another embodiment, the effects of the variant on at least one biochemical regulatory process is determined by a second computational model that has been trained utilizing a set of features of a regulatory effect profile and the regulatory effect profile is one of: a chromatin regulatory effect profile and a RNA binding protein (RBP) and RNA element profile.
In yet another embodiment, the second computational model is a deep neural network.
In a further embodiment, the second computational model is a convolutional neural network.
In still yet another embodiment, the regulatory profile is the chromatin regulatory effect profile, and wherein the set of features are cell-type specific.
In yet a further embodiment, the regulatory profile is the chromatin regulatory effect profile, and wherein the set of features include at least one of: sites of chromatin accessibility, chromatin marks, and transcription factor binding sites.
In an even further embodiment, the chromatin regulatory effect profile is determined utilizing at least one epigenetic assay selected from a group consisting of: chromatin immunoprecipitation sequencing (ChIP-seq), DNAse I hypersensitivity sequencing (DNase-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), Hi-C capture sequencing, bisulfite sequencing (BS-seq), and a methyl array.
In yet an even further embodiment, the regulatory profile is the RBP and RNA element profile, and wherein the set of features are cell-type specific.
In still yet an even further embodiment, the regulatory profile is the RBP and RNA element profile, and wherein the set of features include RBP binding sites.
In still yet an even further embodiment, the RBP and RNA element profile is determined utilizing at least one RNA-binding assays selected from a group consisting of: cross-linking immunoprecipitation sequencing (CLIP-seq) and RNA immunoprecipitation sequencing (RIP-seq).
In still yet an even further embodiment, the genetic material is one of: a whole genome or a partial genome.
In still yet an even further embodiment, the genetic material is obtained from a biopsy of the individual.
In still yet an even further embodiment, the sequencing performed is one of: whole genome sequencing or capture sequencing.
In still yet an even further embodiment, the biochemical regulatory process is selected from a group consisting of: transcriptional regulation, posttranscriptional regulation, and translational regulation.
In still yet an even further embodiment, the identified set of variants include at least one de novo variant.
In still yet an even further embodiment, the identified set of variants include at least one inherited variant.
In still yet an even further embodiment, at least one locus the set of genomic loci is determined based upon the pathogenicity results of applying the trained computational model to a set a variants that have been identified for a collection of individuals having been diagnosed for the medical disorder.
In still yet an even further embodiment, at least one locus the set of genomic loci is identified experimentally to be associated with the medical disorder.
In still yet an even further embodiment, the computational model is a linear regression.
In still yet an even further embodiment, the linear regression model is L2 regularized.
In still yet an even further embodiment, the diagnosis is determined based upon a threshold, and wherein when the individual's cumulative pathogenicity score is above a threshold, the individual is determined to have a propensity for the medical disorder is determined.
In still yet an even further embodiment, the medical disorder is a complex medical disorder.
In still yet an even further embodiment, the medical disorder is selected from a group consisting of: autism spectrum disorder, Alzheimer disease, arthritis, asthma, bipolar disorder, cancer, cleft lip and/or palate, coronary artery disease, Crohn's disease, dementia, depression, diabetes (type II), heart disease, heart failure, high cholesterol, hypertension, hypothyroidism, irritable bowel syndrome, obesity, osteoporosis, Parkinson disease, rhinitis, psoriasis, multiple sclerosis, schizophrenia, sleep apnea, spina bifida, and stroke.
In still yet an even further embodiment, the medical disorder is autism spectrum disorder and treating the individual comprises administering at least one of: behavioral therapy, communication therapy, educational therapy, and risperidone.
In still yet an even further embodiment, the set of set of known pathogenic variants is derived from the Human Gene Mutation Database.
In still yet an even further embodiment, the set of null variants is derived from at least one of: the International Genome Sample Resource (IGSR) 1000 Genomes project, a set of common variants with no expected pathogenicity, a set of variants randomly generated by in silico methods.
In an embodiment to treat an individual for a medical disorder, genetic material of an individual that includes a set of genomic loci is sequenced. Each locus of the set of genomic loci contains sequence that has been determined to harbor a pathogenic variant that affects at least one biochemical regulatory process. The effect of harboring a pathogenic variant within each genomic loci has been associated with the pathogenicity of a medical disorder as determined by the effects of the variant on the at least one biochemical regulatory process. A set of variants that reside within the set of genomic loci sequenced is identified. A first trained computational model to determine a biochemical regulatory effects of the identified variants is obtained. The biochemical regulatory effects are one of: effects on transcriptional regulation or effects on posttranscriptional regulation. The first computational model is trained utilizing a set of features of a regulatory effect profile. The regulatory effect profile is one of: a chromatin regulatory effect profile and a RNA binding protein (RBP) and RNA element profile. The biochemical regulatory effect of each identified variant is determined. A second trained computational model to determine pathogenicity of each variant of the set of variants identified is obtained. The pathogenicity of each variant is based upon an aggregation of the variant's effects upon the at least one biochemical regulatory process. The second computational model is trained utilizing a set of known pathogenic variants and a set of null variants. Utilizing the trained computational model, a diagnosis of the individual is determined based upon a cumulative pathogenicity score of the individual. The diagnosis indicates a propensity for the medical disorder. The cumulative pathogenicity score is determined by aggregating pathogenicity of the individual's variants within the set of genomic loci. When the individual is determined to have a diagnosis indicating a propensity for the medical disorder, the individual is treated for the medical disorder
In another embodiment, the first computational model is a deep neural network.
In yet another embodiment, the first computational model is a convolutional neural network.
In a further embodiment, the regulatory profile is the chromatin regulatory effect profile, and wherein the set of features are cell-type specific.
In still yet another embodiment, the regulatory profile is the chromatin regulatory effect profile, and wherein the set of features include at least one of: sites of chromatin accessibility, chromatin marks, and transcription factor binding sites.
In yet a further embodiment, the chromatin regulatory effect profile is determined utilizing at least one epigenetic assay selected from a group consisting of: chromatin immunoprecipitation sequencing (ChIP-seq), DNAse I hypersensitivity sequencing (DNase-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), Hi-C capture sequencing, bisulfite sequencing (BS-seq), and a methyl array.
In an even further embodiment,
In yet an even further embodiment, the regulatory profile is the RBP and RNA element profile, and wherein the set of features are cell-type specific.
In still yet an even further embodiment, the regulatory profile is the RBP and RNA element profile, and wherein the set of features include RBP binding sites.
In still yet an even further embodiment, the RBP and RNA element profile is determined utilizing at least one RNA-binding assays selected from a group consisting of: cross-linking immunoprecipitation sequencing (CLIP-seq) and RNA immunoprecipitation sequencing (RIP-seq).
In still yet an even further embodiment, the genetic material is one of: a whole genome or a partial genome.
In still yet an even further embodiment, the genetic material is obtained from a biopsy of the individual.
In still yet an even further embodiment, the sequencing performed is one of: whole genome sequencing or capture sequencing.
In still yet an even further embodiment, the biochemical regulatory process is selected from a group consisting of: transcriptional regulation, posttranscriptional regulation, and translational regulation.
In still yet an even further embodiment, the identified set of variants include at least one de novo variant.
In still yet an even further embodiment, the identified set of variants include at least one inherited variant.
In still yet an even further embodiment, at least one locus the set of genomic loci is determined based upon the pathogenicity results of applying the second trained computational model to a set a variants that have been identified for a collection of individuals having been diagnosed for the medical disorder.
In still yet an even further embodiment, at least one locus the set of genomic loci is identified experimentally to be associated with the medical disorder.
In still yet an even further embodiment, the second computational model is a linear regression.
In still yet an even further embodiment, the linear regression model is L2 regularized.
In still yet an even further embodiment, the diagnosis is determined based upon a threshold, and wherein when the individual's cumulative pathogenicity score is above a threshold, the individual is determined to have a propensity for the medical disorder is determined.
In still yet an even further embodiment, the medical disorder is a complex medical disorder.
In still yet an even further embodiment, the medical disorder is selected from a group consisting of: autism spectrum disorder, Alzheimer disease, arthritis, asthma, bipolar disorder, cancer, cleft lip and/or palate, coronary artery disease, Crohn's disease, dementia, depression, diabetes (type II), heart disease, heart failure, high cholesterol, hypertension, hypothyroidism, irritable bowel syndrome, obesity, osteoporosis, Parkinson disease, rhinitis, psoriasis, multiple sclerosis, schizophrenia, sleep apnea, spina bifida, and stroke.
In still yet an even further embodiment, the medical disorder is autism spectrum disorder and treating the individual comprises administering at least one of: behavioral therapy, communication therapy, educational therapy, and risperidone.
In still yet an even further embodiment, the set of set of known pathogenic variants is derived from the Human Gene Mutation Database.
In still yet an even further embodiment, the set of null variants is derived from at least one of: the International Genome Sample Resource (IGSR) 1000 Genomes project, a set of common variants with no expected pathogenicity, a set of variants randomly generated by in silico methods.
In an embodiment of treating autism spectrum disorder, genetic material of an individual that includes a set of genomic loci is sequenced. Each locus of the set of genomic loci contains sequence that has been determined to harbor a pathogenic variant that affects at least one biochemical regulatory process. The effect of harboring a pathogenic variant within each genomic loci has been associated with the pathogenicity of autism spectrum disorder as determined by the effects of the variant on the at least one biochemical regulatory process. A set of variants that reside within the set of genomic loci sequenced is identified. A trained computational model to determine pathogenicity of each variant of the set of variants identified is obtained. The pathogenicity of each variant is based upon an aggregation of the variant's effects upon the at least one biochemical regulatory process. The computational model is trained utilizing a set of known pathogenic variants and a set of null variants. Utilizing the trained computational model, a diagnosis of the individual is determined based upon a cumulative pathogenicity score of the individual. The diagnosis indicates a propensity for autism spectrum disorder. The cumulative pathogenicity score is determined by aggregating pathogenicity of the individual's variants within the set of genomic loci. When the individual is determined to have a diagnosis indicating a propensity for autism spectrum disorder, the individual is treated for autism spectrum disorder.
In another embodiment, the effects of the variant on at least one biochemical regulatory process is determined by a second computational model that has been trained utilizing a set of features of a regulatory effect profile and the regulatory effect profile is one of: a chromatin regulatory effect profile and a RNA binding protein (RBP) and RNA element profile.
In yet another embodiment, the second computational model is a deep neural network.
In a further embodiment, the second computational model is a convolutional neural network.
In still yet another embodiment, the regulatory profile is the chromatin regulatory effect profile, and wherein the set of features are cell-type specific.
In yet a further embodiment, the regulatory profile is the chromatin regulatory effect profile, and wherein the set of features include at least one of: sites of chromatin accessibility, chromatin marks, and transcription factor binding sites.
In an even further embodiment, the chromatin regulatory effect profile is determined utilizing at least one epigenetic assay selected from a group consisting of: chromatin immunoprecipitation sequencing (ChIP-seq), DNAse I hypersensitivity sequencing (DNase-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), Hi-C capture sequencing, bisulfite sequencing (BS-seq), and a methyl array.
In yet an even further embodiment, the regulatory profile is the RBP and RNA element profile, and wherein the set of features are cell-type specific.
In still yet an even further embodiment, the regulatory profile is the RBP and RNA element profile, and wherein the set of features include RBP binding sites.
In still yet an even further embodiment, the RBP and RNA element profile is determined utilizing at least one RNA-binding assays selected from a group consisting of: cross-linking immunoprecipitation sequencing (CLIP-seq) and RNA immunoprecipitation sequencing (RIP-seq).
In still yet an even further embodiment, the genetic material is one of: a whole genome or a partial genome
In still yet an even further embodiment, the genetic material is obtained from a biopsy of the individual.
In still yet an even further embodiment, the sequencing performed is one of: whole genome sequencing or capture sequencing.
In still yet an even further embodiment, the biochemical regulatory process is selected from a group consisting of: transcriptional regulation, posttranscriptional regulation, and translational regulation.
In still yet an even further embodiment, the identified set of variants include at least one de novo variant.
In still yet an even further embodiment, the identified set of variants include at least one inherited variant.
In still yet an even further embodiment, at least one locus the set of genomic loci is determined based upon the pathogenicity results of applying the trained computational model to a set a variants that have been identified for a collection of individuals having been diagnosed for autism spectrum disorder.
In still yet an even further embodiment, at least one locus the set of genomic loci is identified experimentally to be associated with autism spectrum disorder.
In still yet an even further embodiment, the computational model is a linear regression.
In still yet an even further embodiment, the linear regression model is L2 regularized.
In still yet an even further embodiment, the diagnosis is determined based upon a threshold, and wherein when the individual's cumulative pathogenicity score is above a threshold, the individual is determined to have a propensity for autism spectrum disorder is determined.
In still yet an even further embodiment, treating the individual comprises administering at least one of: behavioral therapy, communication therapy, educational therapy, and risperidone.
In still yet an even further embodiment, behavioral therapy is administered and includes teaching the individual behavioral skills across different settings and reinforcing desirable characteristics.
In still yet an even further embodiment, communication therapy is administered and includes performing speech and language pathology to improve development of language and communication skills.
In still yet an even further embodiment, educational therapy is administered and includes enrolling the subject in special education classes.
In still yet an even further embodiment, the set of set of known pathogenic variants is derived from the Human Gene Mutation Database.
In still yet an even further embodiment, the set of null variants is derived from at least one of: the International Genome Sample Resource (IGSR) 1000 Genomes project, a set of common variants with no expected pathogenicity, a set of variants randomly generated by in silico methods.
In an embodiment for evaluating genetic data to determine biochemical regulatory effects of variants, using computer systems, a neural network computational model is trained to yield a composite of biochemical regulatory effects. The biochemical regulatory effects are one of: effects on transcriptional regulation or effects on posttranscriptional regulation. The deep neural network computational model is trained utilizing a set of features of a regulatory effect profile. The regulatory effect profile is one of: a chromatin regulatory effect profile and a RNA binding protein (RBP) and RNA element profile. Using computer systems, genetic data of a collection of individuals is obtained. Using computer systems, a set of variants is identified within the genetic data of the collection of individuals. Using computer systems and the trained neural network computational model, he biochemical regulatory effects of each variant of the set variants is determined.
In another embodiment, the collection of individuals share a complex trait and each individual has been diagnosed as having the complex trait.
In yet another embodiment, the collection of individuals are unaffected and each individual has not been diagnosed as having the complex trait.
In a further embodiment, the neural network is a deep neural network.
In still yet another embodiment, the neural network is a convolutional neural network.
In yet a further embodiment, the regulatory profile is the chromatin regulatory effect profile, and wherein the set of features are cell-type specific.
In an even further embodiment, the regulatory profile is the chromatin regulatory effect profile, and wherein the set of features include at least one of: sites of chromatin accessibility, chromatin marks, and transcription factor binding sites.
In yet an even further embodiment, the chromatin regulatory effect profile is determined utilizing at least one epigenetic assay selected from a group consisting of: chromatin immunoprecipitation sequencing (ChIP-seq), DNAse I hypersensitivity sequencing (DNase-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), Hi-C capture sequencing, bisulfite sequencing (BS-seq), and a methyl array.
In still yet an even further embodiment, the regulatory profile is the RBP and RNA element profile, and wherein the set of features are cell-type specific.
In still yet an even further embodiment, the regulatory profile is the RBP and RNA element profile, and wherein the set of features include RBP binding sites.
In still yet an even further embodiment, the RBP and RNA element profile is determined utilizing at least one RNA-binding assays selected from a group consisting of: cross-linking immunoprecipitation sequencing (CLIP-seq) and RNA immunoprecipitation sequencing (RIP-seq).
In still yet an even further embodiment, the genetic material is one of: a whole genome or a partial genome
In still yet an even further embodiment, the genetic material is obtained from a biopsy of each individual of the collection of individuals.
In still yet an even further embodiment, the identified set of variants includes at least one de novo variant.
In still yet an even further embodiment, the identified set of variants includes at least one inherited variant.
In still yet an even further embodiment, a biochemical assay is performed to further assess at least one variant of the set variants, wherein the biochemical assay assesses one of: transcription, RNA processing, translation, or cell function.
In still yet an even further embodiment, the biochemical assay is selected from a group consisting of: chromatin immunoprecipitation sequencing (ChIP-seq), DNAse I hypersensitivity sequencing (DNase-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), Hi-C capture sequencing, bisulfite sequencing (BS-seq), methyl array, transgene expression analysis, qPCR, RNA hybridization, cross-linking immunoprecipitation sequencing (CLIP-seq), RNA immunoprecipitation sequencing (RIP-seq), RNA-seq, western blot, immunodetection, flow cytometry, enzyme-linked immunosorbent assay (ELISA), and mass spectrometry.
In an embodiment for evaluating pathogenicity of variants, using computer systems, a linear regression model is trained to yield a pathogenicity of a variant based on the variant's effect on biochemical regulation. The pathogenicity of the variant is based upon an aggregation of the effects upon the at least one biochemical regulatory process. The computational model is trained utilizing a set of known pathogenic variants and a set of null variants. The effects on biochemical regulation has been determined for each variant of the set of pathogenic variants and of the set of null variants. Using the computer systems, a set of variants to determine pathogenicity is obtained. The effects on biochemical regulation has been determined for each variant of the set of variants to determine pathogenicity. Using the computer systems and the trained linear regression model, the pathogenicity of each variant of the set of variants is determined.
In another embodiment, the effects of biochemical regulation have been determined by a neural network computational model, wherein the biochemical regulatory effects are one of: effects on transcriptional regulation or effects on posttranscriptional regulation, wherein the deep neural network computational model is trained utilizing a set of features of a regulatory effect profile, and wherein the regulatory effect profile is one of: a chromatin regulatory effect profile and a RNA binding protein (RBP) and RNA element profile.
In yet another embodiment, the neural network is a deep convolutional neural network.
In a further embodiment, the linear regression model is L2 regularized
In still yet another embodiment, the biochemical regulatory process is selected from a group consisting of: transcriptional regulation, posttranscriptional regulation, and translational regulation.
In yet a further embodiment, the set of known pathogenic variants is retrieved from the Human Gene Mutation Database.
In an even further embodiment, the set of null variants is derived from at least one of: the International Genome Sample Resource (IGSR) 1000 Genomes project, a set of common variants with no expected pathogenicity, a set of variants randomly generated by in silico methods.
In yet an even further embodiment, each variant of the obtained set of variants is associated with a complex trait.
In still yet an even further embodiment, the complex trait is a medical disorder.
In still yet an even further embodiment, the obtained set of variants is derived from a collection of individuals, and wherein each individual of the collection of individuals share the complex trait.
In still yet an even further embodiment, each obtained variant's pathogenicity is aggregated to achieve a cumulative pathogenicity score for the set of obtained variants.
In still yet an even further embodiment, the obtained set of variants includes at least one de novo variant.
In still yet an even further embodiment, the obtained set of variants includes at least one inherited variant.
In still yet an even further embodiment, a biochemical assay is performed to further assess at least one variant of the set variants, wherein the biochemical assay assesses one of: transcription, RNA processing, translation, or cell function.
In still yet an even further embodiment, the biochemical assay is selected from a group consisting of: chromatin immunoprecipitation sequencing (ChIP-seq), DNAse I hypersensitivity sequencing (DNase-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), Hi-C capture sequencing, bisulfite sequencing (BS-seq), methyl array, transgene expression analysis, qPCR, RNA hybridization, cross-linking immunoprecipitation sequencing (CLIP-seq), RNA immunoprecipitation sequencing (RIP-seq), RNA-seq, western blot, immunodetection, flow cytometry, enzyme-linked immunosorbent assay (ELISA), and mass spectrometry.
In an embodiment to develop a molecular assay to detect the presence of variants in pathogenic loci, using computer systems and a computational model, the pathogenicity of each variant of a first set of variants is determined. The pathogenicity is determined by the computational model and is based upon the variant's cumulative effects on a set of biochemical regulations. The computational model is trained utilizing a set of known pathogenic variants and a set of null variants. A set of genomic loci is identified. Each genetic locus spans across at least one variant of a second set of variants. The second set of variants is at least a subset of the first set of variants.
In another embodiment, the second set of variants are selected based on their pathogenicity. A set of nucleic acid oligomers is synthesized such that the set of nucleic acid oligomers can be utilized in a molecular assay to detect the presence of variants within the set of identified genomic loci.
In yet another embodiment, the computational model is a linear regression model.
In a further embodiment, the linear regression model is L2 regularized.
In still yet another embodiment, the effects of biochemical regulation have been determined by a neural network computational model, wherein the biochemical regulatory effects are one of: effects on transcriptional regulation or effects on posttranscriptional regulation, wherein the deep neural network computational model is trained utilizing a set of features of a regulatory effect profile, and wherein the regulatory effect profile is one of: a chromatin regulatory effect profile and a RNA binding protein (RBP) and RNA element profile.
In yet a further embodiment, the neural network is a deep convolutional neural network.
In an even further embodiment, the biochemical regulatory process is selected from a group consisting of: transcriptional regulation, posttranscriptional regulation, and translational regulation.
In yet an even further embodiment, the set of null variants is derived from at least one of: the International Genome Sample Resource (IGSR) 1000 Genomes project, a set of common variants with no expected pathogenicity, a set of variants randomly generated by in silico methods.
In still yet an even further embodiment, each variant of the first set of variants is associated with a complex trait.
In still yet an even further embodiment, the complex trait is a medical disorder.
In still yet an even further embodiment, the obtained set of variants is derived from a collection of individuals, and wherein each individual of the collection of individuals share the complex trait.
In still yet an even further embodiment, the second set of variants includes at least one de novo variant.
In still yet an even further embodiment, the second set of variants includes at least one inherited variant.
In still yet an even further embodiment, the pathogenicity of each variant of the second set of variants is greater than a threshold.
In still yet an even further embodiment, the molecular assay is capture sequencing and the set of nucleic acid oligomers is capable of hybridizing to the set of identified genomic loci.
In still yet an even further embodiment, the molecular assay is a single nucleotide polymorphism (SNP) array and the set of nucleic acid oligomers is capable of hybridizing to the set of identified genomic loci.
In still yet an even further embodiment, the molecular assay is a sequencing assay and the set of nucleic acid oligomers is capable of amplifying the set of identified genomic loci by polymerase chain reaction (PCR).
In an embodiment, a kit to detect the presence of variants within pathogenic loci includes a set of nucleic acid oligomers to detect the presence of variants within a set of genomic loci. The set of genomic loci have been identified to have harbored a pathogenic variant. The pathogenicity of each pathogenic variant is determined by a computational model and is based upon cumulative effects on a set of biochemical regulations. The computational model is trained utilizing a set of known pathogenic variants and a set of null variants. Each locus the set of genomic loci is selected based upon the pathogenicity of the pathogenic variant it has been identified to have harbored.
In another embodiment, the computational model is a linear regression model.
In yet another embodiment, the linear regression model is L2 regularized.
In a further embodiment, the effects of biochemical regulation have been determined by a neural network computational model, wherein the biochemical regulatory effects are one of: effects on transcriptional regulation or effects on posttranscriptional regulation, wherein the deep neural network computational model is trained utilizing a set of features of a regulatory effect profile, and wherein the regulatory effect profile is one of: a chromatin regulatory effect profile and a RNA binding protein (RBP) and RNA element profile.
In still yet another embodiment, the neural network is a deep convolutional neural network.
In yet a further embodiment, the biochemical regulatory process is selected from a group consisting of: transcriptional regulation, posttranscriptional regulation, and translational regulation.
In an even further embodiment, the set of known pathogenic variants is retrieved from the Human Gene Mutation Database.
In yet an even further embodiment, the set of null variants is derived from at least one of: the International Genome Sample Resource (IGSR) 1000 Genomes project, a set of common variants with no expected pathogenicity, a set of variants randomly generated by in silico methods.
In still yet an even further embodiment, each pathogenic variants is associated with a complex trait.
In still yet an even further embodiment, the complex trait is a medical disorder.
In still yet an even further embodiment, at least one pathogenic variant is a de novo variant.
In still yet an even further embodiment, at least one pathogenic variant is inherited.
In still yet an even further embodiment, the pathogenicity of each pathogenic variant is greater than a threshold.
In still yet an even further embodiment, the set of nucleic acid oligomers is capable of hybridizing to the set of genomic loci for use in a capture sequencing assay.
In still yet an even further embodiment, the set of nucleic acid oligomers is capable of hybridizing to the set of genomic loci for use in a single nucleotide polymorphism (SNP) array.
In still yet an even further embodiment, the set of nucleic acid oligomers is capable of amplifying the set of genomic loci for use in a sequencing assay.
In an embodiment to treat an individual with a medication, genetic material of an individual that includes a set of genomic loci is sequenced. Each locus of the set of genomic loci contains sequence that has been determined to harbor a pathogenic variant that affects at least one biochemical regulatory process. The effect of harboring a pathogenic variant within each genomic loci has been associated with the ability to metabolize a medication as determined by the effects of the variant on the at least one biochemical regulatory process. A set of variants that reside within the set of genomic loci sequenced is identified. A trained computational model to determine pathogenicity of each variant of the set of variants identified is obtained. The pathogenicity of each variant is based upon an aggregation of the variant's effects upon the at least one biochemical regulatory process. The computational model is trained utilizing a set of known pathogenic variants and a set of null variants. Utilizing the trained computational model, a diagnosis of the individual is determined based upon a cumulative pathogenicity score of the individual. The diagnosis indicates an ability to metabolize the medication. The cumulative pathogenicity score is determined by aggregating pathogenicity of the individual's variants within the set of genomic loci. When the individual is determined to have a diagnosis indicating a reduced ability to metabolize the medication, a lower dose of the medication or an alternative medication is administered.
In another embodiment, the medication is selected from the group consisting of: abacavir, acenocoumarol, allopurinol, am itriptyline, aripiprazole, atazanavir, atomoxetine, azathioprine, capecitabine, carbamazepine, carvedilol, cisplatin, citalopram, clomipramine, clopidogrel, clozapine, codeine, daunorubicin, desflurane, desipramine, doxepin, duloxetine, enflurane, escitalopram, esomeprazole, flecainide, fluoruracil, flupenthixol, fluvoxamine, flibenclamide, glicazide, glimepiride, haloperidol, halothane, imipramine, irinotecan, isoflurane, ivacaftor, lansoprazole, mercaptopurine, methoxyflurane, metoprolol, mirtazpine, moclobemide, nortriptyline, olanzapine, omeprazole, ondansetron, oxcarbazepine, oxycodone, pantoprazole, paroxetine, peginterferon alpha-2a, pegineterferon alpha-2b, phenprocoumon, phenytoin, propafenone, rabeprazole, raburicase, ribavirin, risperidone, sertraline, sevoflurane, simvastin, succinylcholine, tacrolimus, tamoxifen, tegafur, thioguanine, tolbutamide, tramadol, trimipramine, tropisetron, venlafaxine, voriconazole, warfarin, and zuclopenthixol.
In yet another embodiment, the medication is risperidone. Low biochemical activity of the gene CYP2D6 indicates the reduced ability to metabolize risperidone.
The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.
Turning now to the drawings and data, a number of processes for genetic data extrapolation that can be utilized in diagnostics, medicament development, and/or treatments in accordance with various embodiments of the invention are illustrated. Numerous embodiments are directed towards a general framework and methods for scoring the functional impact of variants from genetic data. In several embodiments, methods are utilized to determine biochemical regulatory effects of genetic variants in various regions of a genome, including noncoding regions. In various embodiments, methods further use biochemical regulatory effect scores to infer variant pathogenicity scores. In some embodiments, the trait to be examined is a medical disorder and thus a trait pathogenicity score infers diagnostic and medical information. In some embodiments, methods utilize an individual's genetic information to determine biochemical impact of genetic variants of an individual's genome in order to diagnose the individual. And in some embodiments, an individual can be treated based on her diagnosis.
Great progress has been made in the past decade in understanding genetics of complex traits (e.g., autism spectrum disorder (ASD), bipolar disorder, coronary artery disease, diabetes, stroke, and schizophrenia), establishing that particular variants, including copy number variants (CNVs) and single nucleotide variants (SNVs) that likely disrupt protein-coding genes, as causal in the development of a complex trait. In the particular case of ASD, however, all known ASD-associated genes together explain a small fraction of new cases, and it is estimated that overall de novo protein coding mutations, including CNVs, contribute to no more than 30% of simplex ASD cases (i.e., single affected ASD individual in a family). It's been found that the vast majority of identified de novo variants are not within the coding region, yet instead located within intronic and intergenic regions. Despite their prevalence, very little is known regarding the contribution of intronic and intergenic variants to the genetic architecture of ASD and other complex traits. Mutations in coding sequences of genes are interpretable because the genetic code translates DNA mutations into changes in the protein sequence that yields predictable effects on the protein.
It has been suggested that no significant noncoding proband-specific signal was observed in the complex trait of ASD, and that any approach would require a very large cohort to detect signal. Accordingly, the challenge is to move beyond simple mutation counts, which are susceptible to both statistical power challenges and confounding factors, such as the rise in mutation counts with parental age. This difficulty is shared in other complex traits, including various psychiatric diseases, such as (for example) intellectual disabilities and schizophrenia. In fact, little is known about the contribution of noncoding rare variants or de novo mutations to human diseases beyond the less common cases with Mendelian inheritance patterns.
Herein, a potential role for variants, including noncoding variants, has been found in complex disorders, as detailed in various examples described. In fact, variants are likely to be causal in development of complex human traits. It has been found that variants within genetic regulatory regions lead to deleterious effects. Furthermore, variants can impact transcriptional and/or post-transcriptional biochemical function, resulting in causation of complex human traits. Furthermore, mutations within noncoding regions are hard to interpret because there is no “code” like the amino acid codon code, which provides an ability to predict biological effects when a mutation lies within a coding region.
A number of method embodiments have been developed to overcome the problems associated with the difficulty of identifying impactful variants of complex traits. Several of these embodiments enable comparison of variant burden between affected and unaffected individuals not simply in terms of number of variants, but in terms of their biochemical impact and overall pathogenicity (i.e., disease impact). Specifically, in some embodiments, biochemical data demarcating DNA and RNA binding protein interactions were used to train and deploy a deep convolutional-neural-network-based framework that predicts the functional and pathogenicity of variants, with independent models trained for DNA and RNA. This framework, in accordance with various embodiments, can estimate with single nucleotide resolution, the quantitative impact of each variant on transcriptional and post-transcriptional regulatory features, including histone marks, transcription factors and RNA-binding protein (RBP) profiles.
Furthermore, various embodiments are directed to examining variants using a computational model to determine transcriptional and/or posttranscriptional regulatory effect of variants. Computational models, in accordance with a number embodiments, are also used to determine a trait pathogenicity score based on cumulative transcriptional and/or posttranscriptional regulatory effect of variants. In some embodiments, an individual's genome is entered into the computational models to predict a likelihood of trait manifestation, including manifestation of medical disorders. And in several embodiments, diagnostics and/or treatments are performed based upon a likelihood of complex disease manifestation. In some embodiments, a threshold is used to diagnose and determine treatment options.
A number of embodiments are also directed to utilizing an individual's sequencing data and examining various loci known to be involved with pathogenic transcriptional and/or posttranscriptional regulatory effects associated with a trait. By examining specific loci, many embodiments determine an individual's cumulative variant pathogenicity. In some embodiments, when a trait to be examined is a medical disorder, an individual is diagnosed and treated based upon the individual's cumulative variant pathogenicity.
A conceptual illustration of a process to determine pathogenicity of variants related to a particular trait in accordance with an embodiment of the invention is illustrated in
Process 100, in accordance with a number of embodiments, begins with obtaining (101) genetic data from a collection of individuals sharing a complex trait and from a collection of unaffected individuals. In some embodiments, the individuals sharing a complex trait are probands in a simplex family. It is to be understood that a simplex family is a family with a single affected child having a complex trait and the parents and any siblings are unaffected. It should be further understood that a proband refers to the affected child, which is likely to have a set of de novo variants that in the aggregate give rise to the trait. Furthermore, it is to be understood that the aggregate of variants within the unaffected family members is unlikely to give rise to the trait.
In accordance with various embodiments, genetic data can be derived from a number of sources. In some instances, these genetic data are obtained de novo by extracting the DNA from a biological source and sequencing it. Alternatively, genetic sequence data can be obtained from publicly or privately available databases. Many databases exist that store datasets of sequences from which a user can extract the data to perform experiments upon, such as the Simons Simplex Collection. In many embodiments, the genetic sequence data include whole or partial genomes that include noncoding DNA to be examined; accordingly, any genetic data set as appropriate to the requirements of a given application could be used.
As shown in
The number of individuals within a collection can depend on the application and trait to be examined. It should be noted that increasing the number individuals in a collection can improve machine learning and variant aggregation models. Accordingly, in a number of embodiments, collections should include at least several hundred individuals.
Once genetic data are obtained, process 100 can then identify (103) a set of variants that alter biochemical regulation in the collection of individuals sharing a trait. In many embodiments, a variant is a single nucleotide variant (SNV), a copy number variant (CNV), an insertion, or a deletion. Accordingly, a profile of variants that exist all along the genetic data set can be determined for each collection of individuals.
In some embodiments, utilizing unaffected family members of simplex families, de novo variants can be determined for probands and unaffected siblings, which can be used to compare. In several embodiments, de novo noncoding variants are examined for their effect on biochemical regulation (e.g., transcriptional and/or posttranscriptional regulation). Accordingly, the biochemical effects noncoding variants of probands can be differentiated from the biochemical effects of noncoding variants of unaffected family members.
In some embodiments, a computational model is trained utilizing biochemical effect variant profiles such that the model can be used to predict the biochemical effect of variants of affected and unaffected individuals. Biochemical effect variant profile datasets can include (but are not limited to) genome-wide chromatin and RNA-binding profiles. These data sets can yield genomic loci that are important in regulating transcription and/or posttranscriptional processing.
Process 100 determines (105) trait pathogenicity of variants based on variants that alter biochemical regulation. In some embodiments, the pathogenicity of each variant from a collection of individuals is determined. In some embodiments, variant pathogenicity is aggregated to yield a pathogenicity score for a particular trait. In a number of embodiments, a computational model is utilized to determine the pathogenicity of variants, which can be trained using a set of pathogenic regulatory variants and a set of null variants.
In several embodiments, processes to determine trait pathogenicity of variants is utilized in various downstream applications, including (but not limited to) diagnosis of an individual, treatment of individual and/or development of diagnostic assays. These embodiments are described in greater detail in subsequent sections.
A conceptual illustration of a process to determine transcriptional and/or posttranscriptional regulatory effects of variants utilizing computing systems is provided in
Methods to generate chromatin and RBP/RNA-element profiles are well known in the art. Generally, chromatin profiles can be determined utilizing various epigenetic assays including (but not limited to) chromatin immunoprecipitation sequencing (ChIP-seq), DNAse I hypersensitivity sequencing (DNase-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), Hi-C capture sequencing, bisulfite sequencing (BS-seq), and methyl array. RBP/RNA-element profiles can be determined utilizing various RNA-binding assays, including (but not limited to) cross-linking immunoprecipitation sequencing (CLIP-seq) and RNA immunoprecipitation sequencing (RIP-seq). Several databases store chromatin and RBP/RNA-element profiles which can be used, including (but not limited to) Encyclopedia of DNA Elements (ENCODE) (https://www.encodeproject.org/), NIH Roadmap Epigenomics Mapping Consortium (http://www.roadmapepigenomics.org/), and the International Human Epigenome Consortium (IHEC) (https://epigenomesportal.ca/ihec/).
Utilizing chromatin and/or RBP/RNA-element regulatory effects profiles, a computational model is trained (203) to yield a composite transcriptional and/or posttranscriptional regulatory effect model with a number of features. In several embodiments, the computational model is a deep neural network. In some embodiments, the computational model is a convolutional neural network.
Process 200 also obtains (205) genetic data from a collection of individuals having a complex trait and from a collection of unaffected individuals. The particular trait to be examined depends on the task on hand. For example, if process 200 is used to determine regulatory effects of variants of a particular medical disorder, each individual having the trait should be diagnosed with the disorder and each unaffected individual should have not manifested the disorder.
The number of individuals within a collection can depend on the application and trait to be examined. It should be noted that increasing the number individuals in a collection can improve machine learning and variant aggregation models. Accordingly, in a number of embodiments, collections should include at least several hundred individuals.
In many embodiments, genetic data to be obtained can be any sequence data that contain genetic variants, especially variants within noncoding regions. In several embodiments, genetic data are whole or partial genomes inclusive of noncoding regions. In some embodiments, sequencing data is directed to cover various regulatory regions important for the trait to be examined.
In accordance with various embodiments of the invention, genetic data can be derived from a number of sources. In some embodiments, these sources include sequences derived from DNA of a biological source that are subsequently processed and sequenced. In some embodiments, sequences are obtained from a publicly or privately available database. Many databases exist that store datasets of sequences from which a user can extract the data to perform experiments upon.
In many embodiments, biological samples of DNA can be used for sequencing that are each derived from a biopsy of an individual. In particular embodiments, the DNA to be acquired can be derived from biopsies of human patients associated with a phenotype or a disease state and derived from unaffected individuals as well. In some embodiments, DNA can be derived from common research sources, such as in vitro tissue culture cell lines or research mouse models. In many embodiments involving sample extraction, DNA molecules are extracted, processed and sequenced according to methods commonly understood in the field.
In accordance with various embodiments, genetic data are processed (207) to generate variant data for a collection of individuals. In many embodiments, variant profiles are further analyzed and trimmed, often dependent on the application. In some embodiments, variant calls within repeat regions are removed. In some embodiments, indels are removed. In some embodiments, only variants of a particular frequency (e.g., rare variants with MAF 1.0%) are examined and thus all other variants are excluded. In some embodiments, known and/or pre-classified variants from known various databases are removed. For example, when examining variants related to a disorder, it may be ideal to remove known variants that exist in databases of healthy individuals, as it may be reasonable to presume that these variants are not related to a disordered state.
In some embodiments, variant profiles are trimmed to specifically only keep de novo variants (i.e., variants that are not within parental genomes and thus arose in gametes and/or early in development). Many methods are known within the art to trim variant profiles to only de novo variants, which can be performed by a number methods. In some embodiments, the GATK pipeline is used to trim variants (https://software.broadinstitute.org/gatk/). Accordingly, de novo noncoding variant profiles can be created for various collections of individuals. In some embodiments, a de novo noncoding variant profile is generated for a collection of probands. In some embodiments, a de novo noncoding variant profile is generated for a collection of unaffected individuals. In some embodiments, a classifier can be used to score each candidate de novo noncoding variant to obtain a comparable number of high-confidence de novo noncoding variant calls. In some embodiments, the classifier DNMFilter (https://github.com/yongzhuang/DNMFilter) is used to score candidate de novo noncoding variants, utilizing an appropriate threshold of probability (e.g., >0.75; or e.g., >0.5) as determined for each experimental set of variant collections
Process 200 also utilizes variants of a collection of individuals and the trained model of step 203 to determine (209) transcriptional and/or posttranscriptional regulatory effects of the variants. Accordingly, variants that affect transcriptional and/or posttranscriptional regulation are likely causal in complex trait manifestation.
In accordance with several embodiments, variant profiles of collections of individuals, their regulatory effects, and the computational model are stored and/or reported (211). In some embodiments, these profiles and regulatory effects may be used in many further downstream applications, including (but not limited to) identifying regions of regulation that are often affected in a complex trait and determining variant pathogenicity.
While a specific example of a process for determining transcriptional and/or posttranscriptional regulatory effects of variants is described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications.
Depicted in
Process 300 can begin with obtaining (301) a set of pathogenic regulatory variant and a set of null variants (i.e., variants not determined to be a pathogenic regulatory variant). In some embodiments, pathogenic regulatory variants are retrieved from an appropriate database, such as (for example) the Human Gene Mutation Database. Pathogenic regulatory variants should be variants annotated as “regulatory” and known to be involved in pathogenesis of a trait (e.g., medical disorder). In a number of embodiments, null variants are any variants that is not involved with pathogenesis of trait. In some instances, null variants are retrieved from healthy individuals such as (for example) data of the International Genome Sample Resource (IGSR) 1000 Genomes project (http://www.internationalgenome.org/). In some instances, null variants are common variants with no expected pathogenicity are used. In some instances, null variants are generated randomly by in silico methods.
In several embodiments, a set of pathogenic regulatory variant and a set of null variants each have determined biochemical effects. In some embodiments, biochemical effects include transcriptional and/or posttranscriptional effects. In some embodiments, transcriptional and/or posttranscriptional effects are determined as described in
A set of pathogenic regulatory variants and a set of null variants are used to train (303) a computational model to be able to determine pathogenicity of variants based on the variant's aggregated biochemical effects. In several embodiments, a pathogenicity computational model is trained to delineate which biochemical effects are associated with pathogenic variants as opposed to null variants. In many embodiments, a linear regression model is used. In some instances, a linear regression model is L2 regularized and trained using an appropriate package, such as (for example) the xgboost package (https://github.com/dmlc/xgboost). In some embodiments, predicted probabilities are z-transformed to have a particular mean and standard deviation.
Process 300 also obtains (305) a set of regulatory variants associated with a trait, each variant having a determined biochemical effect. A set of regulatory variants can be any set to be examined. In some instances, a set of regulatory variants are associated with a particular medical disorder. In some instances, a set of regulatory variants are associated with ASD. In some instances, a set of regulatory variants and their biochemical effects are determined in accordance with Process 200 described herein. In some instances, a set of regulatory variants are associated with traits shared by a collection of individuals. In some instances, a set of regulatory variants are associated with unaffected individuals, which can be useful for comparing pathogenicity of variants associated with a trait.
Utilizing the trained computational model of Step 303, the pathogenicity of each variant of a set of regulatory variants is determined (307) based upon each variant's aggregated biochemical effect. In some embodiments, a cumulative pathogenicity score for each trait is determined. In some embodiments, a cumulative pathogenicity score for a set of variants is determined by various statistical methods, which may include an aggregate score. In some embodiments, a pathogenicity score is compared between a set of trait associated variants and a set of null variants.
Pathogenicity scores of a set of regulatory variants and a trained computational model is stored and/or reported (309). In a number of embodiments, pathogenicity scores of a set of regulatory variants are used in a number of downstream applications, including (but not limited to) clinical classification of individuals (e.g., clinical diagnostics), further molecular research into the trait, and identification of functionality and tissue specificity. In many embodiments, a trained classification model is used to classify individuals in regards to a trait.
While a specific example of a process for determining pathogenicity scores of regulatory variants is described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications.
As shown in
In accordance with various embodiments, an individual's genetic sequence data are processed (403) to identify variants. In many embodiments, an individual's variant profile is further analyzed and trimmed, often dependent on the application. In some embodiments, variant calls within repeat regions are removed. In some embodiments, indels are removed. In some embodiments, only variants of a particular frequency (e.g., rare variants with MAF≤1.0%) are examined and thus all other variants are excluded. In some embodiments, known and/or pre-classified variants from known various databases are removed. For example, when examining variants related to a disorder, it may be ideal to remove known variants that exist in databases of healthy individuals, as it may be reasonable to presume that these variants are not related to a disordered state.
In some embodiments, variant profiles of an individual are trimmed to specifically only keep de novo variants (i.e., variants that are not within parental genomes and thus arose in gametes and/or early in development). Many methods are known within the art to trim variant profiles to only de novo variants, which can be performed by a number methods. In some embodiments, the GATK pipeline is used to trim variants (https://software.broadinstitute.org/gatk/). In some embodiments, a classifier can be used to score each candidate de novo variant to obtain a comparable number of high-confidence de novo variant calls. In some embodiments, the classifier DNMFilter (https://github.com/yongzhuang/DNMFilter) is used to score candidate de novo variants, utilizing an appropriate threshold of probability (e.g., >0.75; or e.g., >0.5) as determined for each experimental set of variant collections.
In some embodiments, a variant profile is generated for an individual with no medical diagnosis. In some embodiments, a variant profile is generated for an individual that has received a preliminary diagnosis.
A trained computational model capable of determining transcriptional and/or posttranscriptional regulatory effects of variants is also obtained (405). In some embodiments, a trained classification model is trained as shown and described in
The transcriptional and/or posttranscriptional regulatory effects of an individual's variants are reported and/or stored (409). In numerous embodiments, the transcriptional and/or posttranscriptional regulatory effects can be used in a number of downstream applications, which may include (but is not limited to) determining pathogenicity of the regulatory variants, which may be used for diagnosis of individuals and determination of medical intervention.
While a specific example of a process for determining the transcriptional and/or posttranscriptional regulatory effects of an individual's variants is described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications.
As shown in
In several embodiments, a set of variants to be examined has biochemical effects that have been determined. In some embodiments, biochemical effects include transcriptional and/or posttranscriptional effects. In some embodiments, transcriptional and/or posttranscriptional effects are determined as described in
A trained computational model capable of determining pathogenicity of a set of regulatory variants based on each variant's biochemical effect is also obtained (405). In some embodiments, a trained classification model is trained as shown and described in
Trait pathogenicity scores and diagnoses of an individual are stored and/or reported (427). In a number of embodiments, pathogenicity scores of a set of regulatory variants are used in a number of downstream applications, including (but not limited to) diagnoses and treatments of patients.
While a specific example of a process for classifying individuals is described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications.
As shown in
Genomic loci known to harbor pathogenic variants that affect transcriptional and/or posttranscriptional regulation can be identified by any appropriate method. In some instances, genomic loci are identified experimentally. In some instances, genomic loci are identified utilizing a computational model trained to determine transcriptional and/or posttranscriptional regulatory effects and/or pathogenicity of variants, such as (for example) the method portrayed in
Process 500 identifies (503) variants within the genomic loci sequenced. It should be understood the variants identified can be any variant within the loci, and does not have to be the same position of previously identified pathogenic variants. In some embodiments, some of the variants are de novo (i.e., not inherited from parental genome). In some embodiments, at least some of the variants are inherited from a parental genome. In several embodiments, the pathogenicity of some of the variants identified is unknown.
Process 500 also determines (505) cumulative pathogenicity of an individual's variants across genomic loci sequenced. Pathogenicity of variants within genomic loci examined can be scored by an appropriate method. In some embodiments, pathogenicity of each variant is scored utilizing a trained computational model such as (for example) the model described in
An individual is diagnosed (507) in regards to particular trait based upon the cumulative pathogenicity of the individual's variants across genomic loci examined. In some embodiments, then the cumulative pathogenicity is above a certain threshold, a diagnosis for having a particular medical disorder can be made. On the contrary, in some embodiments, when the cumulative pathogenicity is below a certain threshold, an individual is diagnosed as lacking a particular medical disorder. In some instances, a medical disorder is a spectrum and thus diagnoses can be made along the spectrum based on windows of pathogenicity scores. Based on an individual's diagnosis, the individual is treated (509). Treatment will depend on the medical disorder being diagnosed.
While a specific example of a process for diagnosing and treating individuals is described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications.
Turning now to
In a number of embodiments of the invention, the memory (607) may contain a regulatory effect model application (609) and a pathogenicity model application (611) that performs all or a portion of various methods according to different embodiments of the invention described throughout the present application. As an example, processor (603) may perform a trait-related variant analyses methods similar to any of the processes described above with reference to
In some embodiments of the invention, computer systems (601) may include an input/output interface (605) that can be utilized to communicate with a variety of devices, including but not limited to other computing systems, a projector, and/or other display devices. As can be readily appreciated, a variety of software architectures can be utilized to implement a computer system as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Although computer systems and processes for variant analyses and performing actions based thereon are described above with respect to
A number of embodiments are directed towards biochemical assays to be performed based on the results of variants identified to affect transcriptional and/or posttranscriptional regulation and/or the results of a variant's pathogenicity. Accordingly, in several embodiments, methods are performed to determine transcriptional and/or posttranscriptional regulatory effects of variants and/or their pathogenicity, and based on those determinations a biochemical assay is performed to assess transcriptional and/or posttranscriptional regulation. In some embodiments, determination of transcriptional and/or posttranscriptional regulatory effects of variants and/or their pathogenicity by performing methods described in
In many embodiments, biochemical methods are performed as follows:
A number of biochemical assays can be performed on the basis of the determination of a variant's transcriptional and/or posttranscriptional regulatory effect and/or pathogenicity. Generally, biochemical assays will provide a more in depth assessment of variant and how it affects various biological functions, which include effects on chromatin formation, chromatin binding, nearby gene transcription, binding of RNA binding proteins, RNA stability, RNA processing, translation, cellular function, and disorder pathology. A number of biochemical assays are known in the art to assess variant effect, including (but not limited to) chromatin immunoprecipitation sequencing (ChIP-seq), DNAse I hypersensitivity sequencing (DNase-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), Hi-C capture sequencing, bisulfite sequencing (BS-seq), methyl array, transgene expression analysis (e.g., luciferase and eGFP), qPCR, RNA hybridization (e.g., ISH), cross-linking immunoprecipitation sequencing (CLIP-seq), RNA immunoprecipitation sequencing (RIP-seq), RNA-seq, western blot, immunodetection, flow cytometry, enzyme-linked immunosorbent assay (ELISA), and mass spectrometry.
Several embodiments are also directed towards manipulating genetic material in order to analyze variants. In some embodiments, a variant is incorporated into a plasmid construct for analysis. In some embodiments, variants are introduced into at least one allele of the DNA of a biological cell. Several methods are well known to introduce variant mutations within an allele, including (but not limited to) CRISPR mutagenesis, Zinc-finger mutagenesis, and TALEN mutagenesis. In some embodiments, a common variant is changed into rare variant. In some embodiments, a rare variant is changed into a common variant, especially when determining the effect of “correcting” a potential pathogenic variant.
Various embodiments are directed towards development of cell lines having a particular set of variants. In some embodiments, a cell line can be manipulated by genetic engineering to harbor a set of variants. In some embodiments, a cell line can be derived from an individual (e.g., from a biopsy) which would harbor the variants identified in that individual. In some embodiments, a cell line from an individual can be genetically manipulated to “correct” a set of pathogenic variants. In some embodiments, a cell line having a set pathogenic variants and a cell line having a set of control or “corrected” variants may be assessed to determine the cumulative effect of the set of variants, especially when modeling a medical disorder that is associated the set of variants.
Various embodiments are directed to development of treatments related to diagnoses of individuals based on their regulatory variant data. As described herein, an individual may be diagnosed as having a particular trait status in relation to a disease. In some embodiments, an individual is diagnosed as having a disorder or having a high propensity for a disorder. Based on the pathogenicity of one's regulatory variant data, an individual can be treated with various medications and therapeutic regimens.
A number of embodiments are directed towards diagnosing individuals using pathogenicity scores of regulatory variant data. In some embodiments, a trained pathogenicity model has been trained using genetic data of pathogenic variants. In some embodiments, genomic loci known to harbor variants that alter transcriptional and/or posttranscriptional regulation associated with a medical disorder. And in some embodiments, genomic loci known to harbor pathogenic variants are determined using a computational model utilizing genetic data of individuals known to have the medical disorder.
In a number of embodiments, diagnostics can be performed as follows:
Many embodiments of diagnostics improve on traditional diagnostic methods, especially in cases of complex disorders. Because the genetic contribution to complex disorders is often obscured by the fact regulatory variants are combined to yield the disorder, traditional genetic tests of examining a single gene, variant, and/or locus have been unavailable. As described herein, however, in some embodiments, a diagnosis is performed for a complex disease utilizing variant pathogenicity data aggregating techniques, such as those described in
Embodiments are directed towards genomic loci sequencing and/or single nucleotide polymorphism (SNP) array kits to be utilized within various methods as described herein. As described, various methods can diagnose an individual for a complex trait by examining variants in various regulatory genomic loci. Accordingly, a number of embodiments are directed towards genomic loci sequencing and SNP array kits that cover a set of genomic loci to diagnose a particular trait. In some instances, the set of genomic loci are identified by a computational model, such as one described in
A number of targeted gene sequencing protocols are known in the art, including (but not limited to) partial genome sequencing, primer-directed sequencing, and capture sequencing. Generally, targeted sequencing involves selection step either by hybridization and/or amplification of the target sequences prior to sequencing. Therefore, embodiments are directed to sequencing kits that target genomic loci that are known to harbor pathogenic variants to diagnose a particular medical disorder.
Likewise, a number of SNP array protocols are known in the art. In general, chip arrays are set with oligo sequences having a particular SNP. Sample DNA derived from an individual can be processed and then applied to SNP array to determine sites of hybridization, indicating existence of a particular SNP. Thus, embodiments are directed to SNP array kits that target particular SNPs that known to be pathogenic in order to diagnose a particular medical disorder.
The number of genomic loci and/or SNPs to include in a sequencing kit can vary, depending on the genomic loci and/or SNPs to examine for a particular trait and the computational model to be used. In some embodiments, the genomic loci and/or SNPs to be examined are identified by a computational model, such as the computational model described in
Within the examples described below, a number of genomic loci and variants have been identified that are likely pathogenic in ASD. In particular, Table 3 and Electronic Data Table 3 provide a number of variants with high pathogenicity. Table 4 and Electronic Data Table 4 provide a number of gene loci regions that experience a significant burden of pathogenic variants in ASD probands. Accordingly, these identified variants and/or loci can be utilized to develop capture sequencing and/or SNP array kits. In some embodiments, capture sequencing and/or SNP array kits are developed covering regions that have high variant pathogenicity, as identified in Electronic Data Tables 3 and 4. In some of these embodiments, the variants and/or genomic loci are selected based on their statistical score of relevance and/or pathogenicity score.
Several embodiments are directed to the use of medications and/or dietary supplements to treat an individual based on their medical disorder diagnosis. In some embodiments, medications and/or dietary supplements are administered in a therapeutically effective amount as part of a course of treatment. As used in this context, to “treat” means to ameliorate at least one symptom of the disorder to be treated or to provide a beneficial physiological effect.
A therapeutically effective amount can be an amount sufficient to prevent reduce, ameliorate or eliminate symptoms of disorders or pathological conditions susceptible to such treatment, such as, for example, autism, bipolar disorder, depression, schizophrenia, or other diseases that are complex. In some embodiments, a therapeutically effective amount is an amount sufficient to reduce the symptoms of a complex disorder.
Dosage, toxicity and therapeutic efficacy of the compounds can be determined, e.g., by standard pharmaceutical procedures in cell cultures or experimental animals, e.g., for determining the LD50 (the dose lethal to 50% of the population) and the ED50 (the dose therapeutically effective in 50% of the population). The dose ratio between toxic and therapeutic effects is the therapeutic index and it can be expressed as the ratio LD50/ED50. Compounds that exhibit high therapeutic indices are preferred. While compounds that exhibit toxic side effects may be used, care should be taken to design a delivery system that targets such compounds to the site of affected tissue in order to minimize potential damage to other tissue and organs and, thereby, reduce side effects.
Data obtained from cell culture assays or animal studies can be used in formulating a range of dosage for use in humans. If the pharmaceutical is provided systemically, the dosage of such compounds lies preferably within a range of circulating concentrations that include the ED50 with little or no toxicity. The dosage may vary within this range depending upon the dosage form employed and the route of administration utilized. For any compound used in the method of the invention, the therapeutically effective dose can be estimated initially from cell culture assays. A dose may be formulated in animal models to achieve a circulating plasma concentration or within the local environment to be treated in a range that includes the IC50 (i.e., the concentration of the test compound that achieves a half-maximal inhibition of neoplastic growth) as determined in cell culture. Such information can be used to more accurately determine useful doses in humans. Levels in plasma may be measured, for example, by liquid chromatography coupled to mass spectrometry.
An “effective amount” is an amount sufficient to effect beneficial or desired results. For example, a therapeutic amount is one that achieves the desired therapeutic effect. This amount can be the same or different from a prophylactically effective amount, which is an amount necessary to prevent onset of disease or disease symptoms. An effective amount can be administered in one or more administrations, applications or dosages. A therapeutically effective amount of a composition depends on the composition selected. The compositions can be administered one from one or more times per day to one or more times per week; including once every other day. The skilled artisan will appreciate that certain factors may influence the dosage and timing required to effectively treat a subject, including but not limited to the severity of the disease or disorder, previous treatments, the general health and/or age of the subject, and other diseases present. Moreover, treatment of a subject with a therapeutically effective amount of the compositions described herein can include a single treatment or a series of treatments. For example, several divided doses may be administered daily, one dose, or cyclic administration of the compounds to achieve the desired therapeutic result.
A number of medications and treatments are known for several complex disorders, especially those that arise (at least in part) due to regulatory variants. Accordingly, embodiments are directed toward treating an individual with a treatment regime and/or medication when diagnosed with a complex disorder as described herein. Various embodiments are directed to treatments of complex (i.e., multifactorial) disorders, including (but not limited to autism spectrum disorder, Alzheimer disease, arthritis, asthma, bipolar disorder, cancer, cleft lip and/or palate, coronary artery disease, Crohn's disease, dementia, depression, diabetes (type II), heart disease, heart failure, high cholesterol, hypertension, hypothyroidism, irritable bowel syndrome, obesity, osteoporosis, Parkinson disease, rhinitis (allergic and nonallergic), psoriasis, multiple sclerosis, schizophrenia, sleep apnea, spina bifida, and stroke.
Once diagnosed for having a risk of autism spectrum disorder, medical monitoring (e.g., regular check-ups) can be performed to look for signs of developmental delays. Various treatments include behavioral, communication, and educational therapies, each of which strive to improve a diagnosed individual's social and cognitive skills. Behavioral training, including applied behavior analysis, can be performed, in which ASD subjects are taught behavioral skills across different settings and reinforcing the desirable characteristics, such as appropriate social interactions. In some instances, speech and language pathology can be performed to improve development of language and communication skills, including that ability to articulate words wells, comprehend verbal and none verbal clues in a range of settings, initiate conversation, develop conversational skills (e.g., appropriate time to say “good morning” or responses to questions asked). In some instances, an ASD subject is entered into special education courses. In some instances risperidone can be administered, which treats irritability often associated with ASD individuals.
Once diagnosed for having a risk of Alzheimer's disease, neurological and neuropsychological tests can be performed to check mental status. Imaging (e.g., MRI, CT, and PET) can be performed to check for abnormalities in structure or function. A number of supplements may help brain health and may be prophylactic, including (but not limited to) omega-3 fatty acids, curcumin, ginkgo, and vitamin E. Exercise, diet, and social support can help promote good cognitive health. Medications for Alzheimer's include (but are not limited to) cholinesterase inhibitors and memantine.
Once diagnosed for having a risk of arthritis, laboratory tests on various bodily fluids can be performed to determine the type of arthritis. Imaging (e.g., X-rays, CT, MRI, and ultrasound) can be utilize to detect problems in various joints. Physical therapy may help relieve some complications associated with arthritis. Medications for arthritis include (but are not limited to) analgesics, nonsteroidal anti-inflammatory drugs (NSAIDs), counterirritants, disease-modifying antirheumatics drugs, biologic response modifiers, and corticosteroids. Heat pads, ice packs, acupuncture, glucosamine, yoga, and massage are examples of various home/alternative remedies available.
Once diagnosed for having a risk of asthma, tests can be performed to determine lung function. A chest X-ray of CT scan can be performed to determine any structural abnormalities. Medications for asthma include (but are not limited to) inhaled corticosteroids, leukotriene modifiers, long-acting beta agonists, short-acting beta agonists, theophylline, and ipratropium. In some instances, allergy medications may help asthma and thus allergy shots and/or omalizumab can be administered. Regular exercise and maintaining a healthy wait may help reduce asthma symptoms.
Once diagnosed for having a risk of bipolar disorder, a psychiatric assessment can be performed to determine the feelings and behavior patterns. Psychotherapies and medications are available to treat bipolar disorder. Psychotherapies include (but not limited to) interpersonal and social rhythm therapy (IPSRT), cognitive behavioral therapy (CBT), and psychoeducation. Medications include (but not limited to) mood stabilizers, antipsychotics, antidepressants, and anti-anxiety medications. Some lifestyle changes can help manage some cycles of behavior that may worsen the condition, including (but not limited to) limiting drugs and alcohol, forming healthy relationships with positive influence, and getting regular physical activity.
Once diagnosed for having a risk of cancer, physical exams, laboratory tests and imaging (e.g., CT, MRI, PET) can be performed to determine if cancerous tissue is present. A biopsy can be extracted to confirm a growth is cancerous. Various treatments can be performed, including (but not limited to) adjuvant treatment, palliative treatment, surgery, chemotherapy, radiation therapy, immunotherapy, hormone therapy, and targeted drug therapy. Exercise and a healthy diet can help an individual mitigate cancer onset and progression.
Once diagnosed for having a risk of cleft lip or palate, ultrasound can be performed in utero to determine whether a fetus is developing a cleft lip or palate. Typical treatment is surgery to repair the cleft tissue.
Once diagnosed for having a risk of coronary artery disease, an electrocardiogram and/or echogram can be performed to determine a heart's performance. A stress test can be performed to determine the ability of the heart to respond to physical activity. A heart scan can determine whether calcium deposits. Patients having risk of coronary artery disease would benefit greatly from a few lifestyle changes, including (but not limited to) reduce tobacco use, eat healthy foods, exercise regularly, lose excess weight, and reduce stress. Various medications can also be administered, including (but not limited to) cholesterol-modifying medications, aspirin, beta clockers, calcium channel blockers, ranolazine, nitroglycerin, ACE inhibitors and angiotensin II receptor blockers. Angioplasty and coronary artery bypass can be performed when more aggressive treatment is necessary.
Once diagnosed for having a risk of Crohn's disease, a combination of tests and procedures can be performed to confirm the diagnosis, including (but not limited to) blood tests and various visual procedures such as a colonoscopy, CT scan, MRI, capsule endoscopy and balloon-assisted enteroscopy. Treatments for Crohn's disease includes corticosteroids, oral 5-aminosliclates, azathioprine, mercaptopurine, infliximab, adalimumab, certolizumab pegol, methotrexate, natalizumab and vedolizumab. A special diet may help suppress some inflammation of the bowel.
Once diagnosed for having a risk of dementia, further analysis of mental function can be performed to gauge memory, language skills, ability to focus, ability to reason, and visual perception. These analyses can be performed utilizing cognitive and neuropsychological tests. Brain scan (e.g., CT, MRI, and PET) and laboratory tests can be performed to determine if physiological complications exist. Medications for dementia include cholinesterase inhibitors and memantine.
Once diagnosed for having a risk of diabetes, a number of tests can be performed to determine an individual's glucose levels and regulation, including (but not limited to) glycated hemoglobin A1C test, fasting blood sugar levels, and oral glucose tolerance test. Routine visits may be performed to get a long-term regulatory look at glucose regulation. In addition, a glucose monitor can be utilized to continuously monitor glucose levels. Diabetes can be managed by various options, including (but not limited to) healthy eating, regular exercise, medication, and insulin therapy. Medications for diabetes include (but are not limited to) metformin, sulfonylureas, meglitinides, thiazolidinediones, DPP-4 inhibitors, SGLT inhibitors, and insulin.
Once diagnosed for having a risk of heart disease, various tests can be performed to determine heart function, including (but not limited to) electrocardiogram, Holter monitoring, echocardiogram, stress test, and cardiac catheterization. Lifestyle changes can dramatically improve heart disease, including (but not limited to) limiting tobacco products, controlling blood pressure, keeping cholesterol in check, keeping blood glucose levels in a good range, physical activities, eating healthy, maintaining a healthy weight, managing stress, and coping with depression. A number of medications can be provided, as dependent on the type heart of disease.
Once diagnosed for having a risk of heart failure, various tests can be performed to confirm the diagnosis, including (but not limited to) physical exams, blood tests, chest X-rays, electrocardiogram, stress test, imaging (e.g., CT and MRI), coronary angiogram, and myocardial biopsy. Medications for heart failure include (but are not limited to) ACE inhibitors, angiotensin II receptor blockers, beta blockers, diuretics, aldosterone antagonists, inotropes, and digoxin. Surgical procedures may be necessary, and include (but are not limited to) coronary bypass surgery and heart valve repair/replacement.
Once diagnosed for having a risk of high cholesterol, blood tests can be performed to measure total cholesterol, LDL cholesterol, HDL cholesterol, and triglycerides. Medications to manage cholesterol levels include (but are not limited to) statins, bile-acid-binding resins, cholesterol absorption inhibitors, and fibrates. Supplements can also be taken, including (but not limited to) co-enzyme Q, red yeast rice extract, niacin, soluble fiber, and omega-3-fatty acids. Individuals at risk for high cholesterol should also reduce tobacco products, eat a healthy diet (avoiding saturated fat, trans fat, and salt), and get regular exercise.
Once diagnosed for having a risk of hypertension, blood pressure levels can be monitored periodically (even at home). Elevated blood pressure and hypertension benefit from lifestyle changes including, eating healthy, reducing sodium intake, regular physical activity, maintaining a proper rate, and limiting alcohol intake. Medications for hypertension include (but are not limited to) ACE inhibitors, angiotensin II receptor blockers, calcium channel blockers, alpha blockers, beta blockers, aldosterone antagonists, renin inhibitors, vasodilators, and central-acting agents.
Once diagnosed for having a risk of hypothyroidism, blood tests can be performed to measure the level of TSH and thyroid hormone thyroxine. Medications for hypothyroidism includes (but is not limited to) synthetic thyroid hormone levothyroxine, which may be taken with supplements such as iron, aluminum hydroxide, and calcium to help absorption.
Once diagnosed for having a risk of irritable bowel syndrome (IBS), physical exams can be performed to confirm IBS including determining type of IBS. These exams include (but are not limited to) flexible sigmoidoscopy, colonoscopy, X-ray, and CT scan. A proper diet can be utilized to manage symptoms, including (but not limited to) high fiber fluids, plenty of fluids, and avoiding the following: high-gas foods, gluten, and FODMAPs. Medications for IBS include (but are not limited to) alosetron, eluxadoline, rifaximin, lubiprostone, linaclotide, fiber supplements, laxatives, anti-diarrheal medications, anticholinergic medications, antidepressants, and pain medications.
Once diagnosed for having a risk of obesity, a physiological test to determine body-mass index (BMI) may be performed. Obesity can be managed by various lifestyle remedies including (but not limited to) healthy diet, physical activity, and limiting tobacco products. If obesity is severe, various surgeries can be performed, including (but not limited to) gastric bypass surgery, laparoscopic adjustable gastric banding, biliopancreatic diversion with duodenal switch, and gastric sleeve.
Once diagnosed for having a risk of osteoporosis, bone density can be measured and routinely monitored using X-rays and other devices, as known in the art. Medications for osteoporosis include (but are not limited to) bisphosphonates, estrogen (and estrogen mimics), denosumab, and teriparatide. To reduce the risk of osteoporosis development, individuals can make various lifestyle changes, including (but not limited to) limiting tobacco use, limiting alcohol intake, and taking measures to prevent falls.
Once diagnosed for having a risk of Parkinson's disease, a single-photon emission computerized tomography (SPECT) scan can image dopamine transporter activity in the brain, which can be monitored over time. Medications for Parkinson's includes (but are not limited to) carbidopa-levodopa, dopamine agonists, MAO B inhibitors, COMT inhibitors, anticholinergics and amantadine.
Once diagnosed for having a risk of rhinitis, various tests can be performed to determine if the rhinitis is due to allergies, including (but not limited to) skin tests looking for allergic reaction, blood tests to measure responses to allergies (e.g., IgE levels). Medications for rhinitis include (but are not limited to) saline nasal sprays, corticosteroid nasal sprays, antihistamines, anticholinergic nasal sprays, and decongestants.
Once diagnosed for having a risk of psoriasis, routine physical exams of the skin, scalp and nails can be performed to look for signs of inflammation. A number of topical treatments can be performed for psoriasis, including (but not limited to) topical corticosteroid, vitamin D analogues, anthralin, topical retinoids, calcineurin inhibitors, salicylic acid, coal tar, and moisturizers. A number of phototherapies can also be performed, including (but not limited to) exposure to sunlight, UVB phototherapy, Goeckerman therapy, excimer laser, and psoralen plus ultraviolet A therapy. Medications for psoriasis include (but are not limited to) retinoids, methotrexate, cyclosporine, and biologics that reduce immune-mediated inflammation (e.g., entanercept, infliximab, adalimumab).
Once diagnosed for having a risk of multiple sclerosis (MS), various tests can be performed overtime to monitor symptoms of MS, including (but not limited to) blood tests, lumbar puncture, MRI and evoked potential tests. A number treatments can help treat acute MS symptoms and to mitigate MS progression, including (but not limited to) corticosteroids, plasma exchange, ocrelixumab, beta interferons, glatiramer acetate, dimethyl fumarate, fingolimod, teriflunomide, natalizumab, alemtuzumab, and mitoxantrone. Physical therapy and muscle relaxants also help mitigate (or prevent) MS symptoms.
Once diagnosed for having a risk of schizophrenia, a physical exam and/or psychiatric evaluation may be performed to determine if symptoms of schizophrenia are apparent. Various antipsychotics may be administered, including (but not limited to) aripiprazole, asenapine, brexpiprazole, cariprazine, clozapine, iloperidone, lurasidone, olanzapine, paliperidone, quetiapine, risperidone, and ziprasidone. Individual with risk of schizophrenia may also benefit from various psychosocial interventions, normalizing thought patterns, improving communication skills, and improving the ability to participate in daily activities.
Once diagnosed for having a risk of sleep apnea, an evaluation that monitors an individual's sleep may be performed, including (but not limited to) nocturnal polysomnography, measurements of heart rate, blood oxygen levels, airflow, and breathing patterns. Sleep apnea therapy may include the use of a continuous positive airway pressure (CPAP) device. A number of lifestyle changes have also been shown to mitigate complications associated with sleep apnea, including (but not limited to) losing excess weight, physical activity, mitigating alcohol consumption, and sleeping on side or abdomen.
Once diagnosed for having a risk of spina bifida, prenatal screening tests can be performed and routinely monitored determine if a fetus is developing spina bifida. Blood tests that can be performed include (but are not limited to) maternal serum alpha-fetoprotein test and measurement AFP levels. Routine ultrasound can be performed to screen for spina bifida. Various treatments include (but are not limited to) prenatal surgery to repair the baby's spinal cord and post-birth surgery to put the meninges back in place and close the opening of the vertebrae.
Once diagnosed for having a risk of stroke, routine monitoring can be performed to determine coronary health status, including (but not limited to) blood clotting tests, imaging (e.g., CT and MRI) to look for potential clots, carotid ultrasound, cerebral angiogram, and echocardiogram. Various procedures that can be performed include (but are not limited to) carotid endarterectomy and angioplasty. Patients having risk of stroke would benefit greatly from a few lifestyle changes, including (but not limited to) reduce of tobacco use, eat healthy foods, exercise regularly, lose excess weight, and reduce stress. Various medications can also be administered, including (but not limited to) cholesterol-modifying medications, aspirin, beta clockers, calcium channel blockers, ranolazine, nitroglycerin, ACE inhibitors and angiotensin II receptor blockers.
A number of embodiments are directed towards altering treatments of individuals based on their biochemical regulation of genes involved with drug metabolism. In some embodiments, a model is trained to identify loci harboring variants that affect regulation of drug metabolizing genes. In some embodiments, genomic loci known to harbor variants that alter transcriptional and/or posttranscriptional regulation are associated with a drug metabolism. In some embodiments, the pathogenicity of the detected variants is determined, which may be used to determine the biochemical activity of a drug metabolizing gene. And in some embodiments, the biochemical activity and/or pathogenicity of variants affected of a drug metabolizing gene are determined using a computational model. Based on results, in some embodiments, dosing can be altered (i.e., high metabolizers are dosed higher and low metabolizers are dosed lower).
Several medications are known to be metabolized differently by individuals based on the expression of a few key genes. Table 5 is a list of medication and genes that are involved with metabolism of that medication. Medications and genes involved in their metabolism can also be found using the PharmGKB database (www.phargkb.org) Accordingly, based on methods described herein that determine alterations biochemical regulation, especially in transcriptional and/or posttranscriptional regulation, an individual can be treated accordingly. For example, the gene CYP2D6 is involved in the metabolism of risperidone. If an individual is found to have regulatory variants that decrease the activity of CYP2D6, then lower doses of oxycodone (or an alternative medication) can be administered. If an individual is found to have regulatory variants that increase the activity of CYP2D6, then higher doses of oxycodone (or an alternative medication) can be administered. In some embodiments, determination of transcriptional and/or posttranscriptional regulatory effects of variants and/or their pathogenicity by performing methods described in
In many embodiments, dosing alteration methods are performed as follows:
Bioinformatic and biological data support the methods and systems of determining the contribution of variants on transcriptional and posttranscriptional regulation and further determining a pathogenicity score using the regulatory variants, and applications thereof. In the ensuing sections, exemplary computational methods and exemplary applications related to variant classifications are provided, especially in the context of autism spectrum disorder (ASD). Exemplary methods and applications can also be found in the publication “Whole-genome deep learning analysis reveal causal role of noncoding mutations in autism” of J. Zhou, et al., bioRxiv 319681 (May 11, 2018), the disclosure of which is herein incorporated by reference.
Within the following examples, a deep-learning based approach for quantitatively assessing the impact of noncoding mutations on human disease is provided. The approach addresses the statistical challenge of detecting the contribution of noncoding mutations by predicting their specific effects on transcriptional and post-transcriptional levels. This approach is general and can be applied to study contributions of mutations to any complex disease or phenotype.
In this example, the strategy was applied to ASD using the 1,790 whole genome sequenced families from the Simons Simplex Collection, and for the first time the results demonstrate a significant proband-specific signal in regulatory de novo noncoding sequence. Importantly, this signal was not only independently detected at the transcriptional level, but the proband-specific posttranscriptional burden was also found to be significant. Previously, there has been limited evidence for disease contribution of mutations disrupting posttranscriptional mechanisms outside of the canonical splice sites. Here, it is demonstrated that significant ASD disease association at the de novo mutation level for variants impacting a large collection of RBPs regulating posttranscriptional regulation. Overall, the results suggest that both transcriptional and posttranscriptional mechanisms play a significant role in complex disorders such as ASD.
The analyses also demonstrate the ability to diagnose complex traits from genetic information, including de novo noncoding mutations that affect transcriptional and posttranscriptional regulation.
Analysis of the noncoding mutation contribution to ASD is challenging due to the difficulty of assessing which noncoding mutations are functional, and further, which of those contribute to the disease phenotype. For predicting the regulatory impact of noncoding mutations, a deep convolutional network-based framework was constructed to directly model the functional impact of each mutation and provide a biochemical interpretation including the disruption of transcription factor binding and chromatin mark establishment at the DNA level and of RBP binding at the RNA level (
To illustrate the capabilities of the transcriptional and posttranscriptional models and pathogenicity computational model, an analysis of the noncoding mutation contribution to ASD was performed using whole genome sequencing (WGS) data was derived from the Simons Simplex Collection (SSC), available via Simons Foundation Autism Research Initiative (SFARI). The data was processed to generate variant calls via the standard GATK pipeline (https://software.broadinstitute.org/gatk/). To call de novo single nucleotide substitutions, inherited mutations were removed, and candidate de novo mutations were selected from the GATK variant calls where the alleles were not present in parents and the parents were homozygous with the same allele. DNMFilter classifier was then used to score each candidate de novo mutation and a threshold of probability>0.75 was applied for SSC phasel-2 and probability>0.5 cutoff for phase3 to obtain a comparable number of high-confidence DNM calls across phases (for more on DNMFilter, see Gene Ontology Consortium, Nucleic Acid Res. 43, D1049-56 (2015), the disclosure of which is herein incorporated by reference).
The DNMFilter classifier was trained with an expanded training set combining the original training standards with the verified DNMs from the SSC pilot WGS studies for the initial 40 SSC families. For final analysis, de novo mutation calls within the low complexity repeat regions from UCSC browser table RepeatMasker were removed (see H. Mi, et al., Nucleic Acids Res. 45, D183-D189 (2017), the disclosure of which is herein incorporated by reference. Also, de novo mutations appearing in multiple SSC families (i.e., non-singleton de novo mutations) or individuals with outlier numbers of mutations (greater than 3 standard deviation more than average) were excluded from the analysis.
Overall genome-wide, 77.7 mutations per individual were detected with Ti/Tv ratio 2.01 [2.00, 2.03] (78.7 for probands with Ti/Tv=2.02 [1.99, 2.04], 76.7 for siblings with Ti/Tv=2.01 [1.99, 2.03]), with no significant difference in mutation substitution patterns between proband and sibling (
For training the transcriptional regulatory effects model, training labels, such as histone marks, transcription factors, and DNase I profiles, were processed from uniformly processed ENCODE and Roadmap Epigenomics data releases. The training procedure is similar to previously described (J. Zhou & O. G. Troyanskaya (2015), cited supra) with several modifications. The model architecture was extended to double the number of convolution layers for increased model depth (see below for details). Input features were expanded to include all of the released Roadmap Epigenomics histone marks and DNase I profiles, resulting in 2,002 total features (subset provided in Table 1; full list is provided in electronic format via Electronic Data Table 1).
The model architecture for transcriptional regulatory effects model:
Input (Size: 4 bases×1000 bp)=>
(#1): Convolution(4→320, kernel size=8)
(#2): ReLU
(#3): Convolution(320→320, kernel size=8)
(#4): ReLU
(#5): Dropout(Probability=0.2)
(#6): Max pooling(pooling size=4)
(#7): Convolution(320→480, kernel size=8)
(#8): ReLU
(#9): Convolution(480→480, kernel size=8)
(#10): ReLU
(#11): Dropout(Probability=0.2)
(#12): Max pooling(pooling size=4)
(#13): Convolution(480→960, kernel size=8)
(#14): ReLU
(#15): Convolution(960→960, kernel size=8)
(#16): ReLU
(#17): Dropout(Probability=0.2)
(#18): Linear(42240→2003)
(#19): ReLU
(#20): Linear(2003→2002)
(#21): Sigmoid
=>Output (Size: 2002 transcriptional regulatory features)
ReLU indicates the rectified linear unit activation function. Sigmoid indicates the Sigmoid activation function. Notations such as ‘4→320’ indicate the input and output channel size for each layer. When not indicated, the output channel size is equal to the input channel size.
For training the posttranscriptional regulatory effects model, the Seqweaver network architecture and training procedure with RNA-binding protein (RBP) profiles as training labels we utilized (see below for architecture and parameters). RNA features, composed of 231 CLIP binding profiles for 82 unique RBPs (ENCODE and previously published CLIP datasets), were uniformly processed. A branch-point mapping profile was used as input features (subset provided in Table 2; full list is provided in electronic format via Electronic Data Table 2). CLIP data processing followed a previously detailed pipeline (J. M. Moore, et al., Nat. protoc. 9, 263-293 (2014), the disclosure of which is herein incorporated by reference). All CLIP peaks with p-value<0.1 were used for training with an additional filter requirement of two-fold enrichment over input for ENCODE eCLIP data. In contrast to the DeepSEA, only transcribed genic regions were considered as training labels for the post-transcriptional regulatory effects model. Specifically, all gene regions defined by Ensemble (mouse build 80, human build 75) were split into 50 nt bins in the transcribed strand sequence. For each sequence bin, RBP profiles that overlapped more than half were assigned a positive label for the corresponding RBP model. Negative labels for a given RBP model were assigned to sequence bins where other RBP's non-overlapping peaks were observed. Note that the deep learning models, both transcriptional and posttranscriptional, each do not use any mutation data for training, and thus each can predict mutation impact regardless of whether it has been previously observed.
The model architecture and parameters for posttranscriptional regulatory effects model:
Dropout Proportion:
Overall design and results of the trained transcriptional (TRD) and posttranscriptional (RRD) models are provided in
To link the biochemical disruption caused by a variant with phenotypic impact, a regularized linear model was trained using a set of curated human disease regulatory noncoding mutations and rare variants from healthy individuals to generate a predicted disease impact score (DIS) (i.e., pathogenicity) for each autism mutation independently based on its predicted transcriptional and post-transcriptional regulatory effects. As mutation-positive examples, 4,401 regulatory noncoding mutations curated in the Human Gene Mutation Database (HGMD) with mutation type “regulatory” (DM, DM?, DFP, DP and FP) were used for training (for more on HGMD and mutation type see P. D Stenson, et al., Hum. Genet. 132, 1-9 (2014), the disclosure of which is herein incorporated by reference). For negative examples of background mutations, 999,668 rare variants that were only observed once within the healthy individuals from the 1000 Genomes project were used (see 1000 Genomes Project Consortium et al., Nature, 526, 68-74 (2015), the disclosure of which is herein incorporated by reference). It was also showed that using common variants with AF>0.01 and within 100 kb to a mutation-positive hit as negative training labels yields similar results to the use of the 1000 Genomes project data. Absolute predicted probability differences computed by the convolutional network transcriptional regulatory effects model were used as input features for each of the 2,002 transcriptional regulatory features and for the 232 post-transcriptional regulatory features in the disease impact model. Input features were standardized to unit variance and zero mean before being used for training. An L2 regularized logistic regression model was separately trained for transcriptional effect model (lambda=10) and post-transcriptional effect model (lambda=10, using only genic region variant examples) with the xgboost package (https://github.com/dmlc/xgboost). The predicted probabilities are z-transformed to have mean 0 and standard deviation 1 across all proband and sibling mutations.
With these approaches, the functional impact of de novo mutations on regulatory factor binding and chromatin properties were systematically assessed using data derived from 7,097 whole genomes from the SSC cohort (total 127,139 non-repeat region SNVs; subset provided in Table 3; full list is provided in electronic format via Electronic Data Table 3). When considering all de novo mutations, a significantly higher functional impact in probands was observed compared to unaffected siblings, independently at the transcriptional (p=9.4×10−3, one-side Wilcoxon rank-sum test for all; FDR=0.033, corrected for all mutation sets tested) and post-transcriptional (p=2.4×10−4, FDR=0.0049) levels (
To gain further insight into the ASD noncoding regulatory landscape, a comprehensive analysis was performed with full multiple hypothesis correction for all combinations of 14 gene-sets and 10 genomic regions tested (e.g., TSS or exon proximal) previously described in D. M. Werling et al. (Nat. Genet. 50, 727-736 (2018), the disclosure of which is herein incorporated by reference).
The 14 gene-sets include GENCODE protein coding genes, Antisense, lincRNAs, Pseudogenes, genes with loss-of-function intolerance (pLI) score>0.9 from ExAC, predicted ASD risk genes (FDR<0.3), FMRP target genes, Genes associated with developmental delay and CHD8 target genes. For genes with expression specific to each 53 GTEx tissue, expression table from GTEx v7 (gene median TPM per tissue) was used to select genes for which expression in a given tissue was five times higher than the median expression across all tissues.
The representative TSS for each gene was determined based on FANTOM CAGE transcription initiation counts relative to GENCODE gene models. Specifically, a CAGE peak is associated to a GENCODE gene if it is within 1000 bp from a GENCODE v24 annotated transcription start site. Peaks within 1000 bp to rRNA, snRNA, snoRNA or tRNA genes were removed to avoid confusion. Next, the most abundant CAGE peak for each gene was selected, and the TSS position reported for the CAGE peak was used as the selected representative TSS for the Gene. For genes with no CAGE peaks assigned, the GENCODE annotated gene start position was used as the representative TSS. FANTOM CAGE peak abundance data were downloaded at http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/ and the CAGE read counts were aggregated over all FANTOM 5 tissue or cell types. GENCODE v24 annotation lifted to GRCh37 coordinates were downloaded from http://www.gencodegenes.org/releases/24lift37.html. All chromatin profiles used from ENCODE and Roadmap Epigenomics projects were listed in Electronic Data Table 1. The HGMD mutations are from HGMD professional version 2018.1.
Human exons that are alternatively spliced (AS) were obtained from a recent study that has examined publicly available human RNA-seq data to annotate an extensive catalog of AS events (Q. Yan, et al., Proc. Natl. Acad. Sci. 111, 3445-3450 (2015), the disclosure of which is herein incorporated by reference). Internal exon regions (both 5′SS & 3′SS flanking introns), upstream exon (5′SS flanking introns), and downstream terminal exon (3′SS flanking introns) were used for alternative exon definition types of cassette, mutually exclusive, tandem cassette exons. Terminal exon region was used for intron retention, alternative 3′ or 5′ exon AS exon types. All selected exon-flanking intronic regions were collapsed into a final set of genomic intervals used to subset SNVs that are located within alternative splicing exon region (200 or 400 nts from exon boundary), illustrated in
When restricted to genomic regions of higher regulatory potential (i.e. near TSS or alternatively spliced exons), an increased dysregulation effect size was observed (
Although one of the hallmarks of autism is altered brain development, a comprehensive tissue association has not been established for de novo noncoding variants. To explore the proband-specific tissue signal, the variant effects for tissue-specific genes derived from all 53 GTEx tissues and cell types was systematically tested (for more GTEx tissues and cell types, see F. Aguet, et al., Nature 550, 204-213 (2017), the disclosure of which is herein incorporated by reference). A consistent significant proband-specific mutation effect associated with brain tissues was observed, with brain regions constituting the top 11 ranked tissues (by difference in proband vs sibling noncoding mutation effect) (
The underlying processes and pathways impacted by de novo noncoding mutations in ASD was investigated. Such analysis is challenging because in addition to the variability in functional impact of mutations, ASD probands appear highly heterogeneous in underlying causal genetic perturbations and single mutations could cause a widespread effect on downstream genes. Thus to detect genes and pathways relevant to the pathogenicity of ASD TRD and RRD mutations, a network-based statistical approach was developed, NDEA (Network-neighborhood Differential Enrichment Analysis) (
NDEA was used to test the differential (proband vs sibling) impact of mutations on each gene or gene set. Intuitively, this test generates a p-value that reflects the proband-specific impact of mutations on that gene or gene set, including through its network neighborhood. This also enables statistical assessment of which gene sets (e.g. pathways) are significantly more affected by proband mutations compared to sibling mutations. Technically, NDEA performs a weighed two-sample (proband vs sibling mutations) test, where the weight for each observation is defined based on network connectivity scores (to the gene or gene sets) and two samples are compared based on weighted averages. Each weight is a non-negative constant number that is used to specify the relative contribution of an observation to the test statistic. When all weights are the same, it reduces to regular two-sample t tests; when the weights are different, it adjusted the standard t statistic to use appropriate variance resulting from weighting. Note, unlike some other weighted t-tests, the weights are not random variables and do not represent sample sizes. The assumptions of the NDEA test are analogous to those of the standard two-sample t test, including that samples in each set are i.i.d. and the weighted sample means are normally distributed.
For each gene i, the NDEA t statistic is computed by
in which μP
Under null hypothesis of the two groups have no difference, the above t statistic approximately follows a t-distribution with the following degree of freedom:
For testing significance difference between proband and sibling mutations, mutations within 100 kb of the representative TSS of all genes and all intronic mutations within 400 bp to exon boundary were included in this analysis. RNA model disease impact scores were used as the mutation score for intronic mutations within 400 bp to exon boundary and DNA model disease impact scores were used for other mutations.
For gene set level NDEA, the gene set was considered as a meta-node that contains all genes that are annotated to the gene set (e.g. GO term). Then, to any given gene the average of network edge scores for all genes in the meta-node is used as the weights. GO term annotations were pooled from human (EBI May 9, 2017), mouse (MGI May 26, 2017) and rat (RGD Apr. 8, 2017). Query GO terms were obtained from the merged set of curated GO consortium slims from Generic, Synapse, ChEMBL, and supplemented by PANTHER GO-slim and terms from NIGO (see Gene Ontology Consortium, Nucleic Acids Res. 43, D1049-56 (2015); H. Mi, et al., Nucleic Acids Res. 45, D183-D189 (2017); and N. Geifman, A Monsonego & E. Rubin BMC Bioinformatics 11, (2010), the disclosures of which are each herein incorporated by reference).
For network-based analysis of correlation between coding and noncoding TRD and RRD mutations, the NDEA t-statistic was first computed for every gene for all protein coding mutations from SSC exome sequencing study, all SSC WGS noncoding mutations within 100 kb to a gene, and all SSC WGS genic noncoding mutations within 400 bp to an exon, respectively. Correlation across all resulting gene-specific t-statistics between all three pairs of mutation types was then computed. For testing statistical significance of the correlation, proband and sibling labels were permuted for all mutations to compute the null distributions of correlations for each pair of mutation type. 1000 permutations were performed.
For network visualization, a two-dimensional embedding with t-SNE was computed by directly taking a distance matrix of all pairs of genes as the input (see L. Van Der Maaten & G. Hinton, J. Mach. Learn. Res. 1 620, 267-84 (2008), the disclosure of which is herein incorporated by reference). The distance matrix was computed as—log(probability) from the edge probability score matrix in the brain-specific functional relationship network. The Barnes-Hut t-SNE algorithm implemented in the Rtsne package was used for the computation. Louvain community clustering were performed on the subnetwork containing all protein-coding genes with top 10% NDEA FDR.
When applied to ASD de novo mutations, the NDEA approach identifies genes whose functional network neighborhood is significantly enriched for genes with stronger predicted disease impact in proband mutations compared to sibling mutations (50 most significant genes provided in Table 4; full list is provided in electronic format via Electronic Data Table 4).
Globally, NDEA enrichment analysis pointed to a proband-specific role for noncoding mutations in affecting neuronal development, including in synaptic transmission and chromatin regulation (
Next, the genetic landscape of ASD-associated de novo noncoding and coding mutations was examined. Specifically, in addition to the network analysis of noncoding mutations at the transcriptional and post-transitional level, it was also applied to the de novo coding mutations. The gene-specific NDEA statistic of elevated proband-specific noncoding mutation burden was compared to that of the coding mutations, finding a significant positive correlation for both TRD and RRD (p=0.004 for TRD, p=0.042 for RRD; two-sided permutation test). Moreover, by network analysis, TRD and RRD are themselves significantly correlated (p=0.034 two-sided permutation test). This demonstrates that coding and noncoding mutations affect overlapping processes and pathways, indicating a convergent genetic landscape, and highlights the potential of ASD gene discovery by combining coding and noncoding mutations.
The gene network analysis identified new candidate noncoding disease mutations with potential impact on ASD through regulation of gene expression. In order to add further evidence to a set of high confidence causal mutations, allele-specific effects of predicted high-impact mutations was examined in cell-based assays (See Table 3 for variants tested). For TRD mutations, fifty nine genomic regions showed strong transcriptional activity with 96% proband variants (57 variants) showing robust differential activity (
To perform the luciferase reporter assays, human neuroblastoma BE(2)-C cells were plated at 2×104 cells/well in 96-well plates and 24 hours later were transfected with Lipofectamine 3000 (L3000-015, Thermofisher Scientific) together with 75 ng of Promega pGL4.23 firefly luciferase vector containing the 230 nt of human genomic DNA from the loci of interest, and 4 ng of pNL3.1 NanoLuc (shrimp luciferase) plasmid, for normalization of transfection conditions. 42 hours after transfection, luminescence was detected with the Promega NanoGlo Dual Luciferase assay system (N1630) and BioTek Synergy plate reader. Four to six replicates per variant were tested in each experiment. For each sequence tested, the ratio of firefly luminescence (ASD allele) to NanoLuc luminescence (transfection control) was calculated and then normalized to empty vector (pGL4.23 with no insert). Statistics were calculated from fold over empty vector values from each biological replicate. High-confidence differentially-expressing alleles were defined by their ability to show the same effect in each biological replicate (n=3, minimum), drive higher than control empty-vector level gene expression, and the two alleles had significantly different level of luciferase activity by two-sided t-test. The data were normalized the fold over empty vector value of the proband allele to that of the sibling allele as shown in
Among these genes with the demonstrated strong differential activity mutations, NEUROG1 is an important regulator of initiation of neuronal differentiation and in the NDEA analysis had significant network neighborhood proband excess (p=8.5×10−4), and DLGAP2 a guanylate kinase localized to the post-synaptic density in neurons. Mutations near HES1 and FEZF1 also carried significant differential effect on activator activities: neurogenin, HES, and FEZF family transcription factors act in concert during development, both receiving and sending inputs to Wnt and Notch signaling in the developing central nervous system and interestingly, the gut, to control stem cell fate decisions; and Wnt and Notch pathways have been previously associated with autism. SDC2 is a synaptic syndecan protein involved in dendritic spine formation and synaptic maturation, and a structural variant near the 3′ end of the gene was reported in an autistic individual. Thus, the method described herein identified alleles of high predicted impact that do indeed show changes in transcriptional regulatory activity in cells. Since many autism genes are under strong evolutionary selection, only effects exerted through (more subtle) gene expression changes may be observable because complete loss of function mutations may be lethal. This implies that further study of the prioritized noncoding regulatory mutations should yield insights into the range of dysregulations associated with autism.
In addition, as a case study for prioritized RRD mutations, the effect of an ASD proband de novo noncoding mutation laying outside of a canonical splice site that was predicted to disrupt splicing of SMEK1 was experimentally validated (ExAC pLI=1.0;
For this mutation, a >40% reduction in the inclusion of the exon for the ASD proband allele compared to the sibling allele was observed in a minigene assay, which is in agreement with the high predicted RRD impact. This demonstrates the highly disruptive biochemical impact a non-splice site de novo mutation can have on RNA splicing.
The minigene assay was performed by first constructing the SMEK1 minigene by amplifying the genomic region with primers:—upstream exon+˜1,400 nt intron (TGTGTGGAGCACCATACCTACCA/CCACACTTGAACAAAACTCTATTGTCAAC) (Seq. ID Nos. 3 and 4) and alternative exon, downstream exon+˜1,400 nt intron (GGTAGGACACAAGTCTCCACAAAGC/GGCAGAGTTCATCAGATTGTAGCG) (Seq. ID Nos. 5 and 6). The produce was then cloned into pSG5 vector. Minigene (2 μg) was transfected into SH-SYSY cells. Cells were harvested 48 h post-transfection for immunoblotting or RT-qPCR following standard protocols. Three independent experiments were performed for statistical comparison.
Case Study: Association of IQ with De Novo Noncoding Mutation in ASD Individuals
De novo noncoding mutations provide a vast space for exploration of phenotype heterogeneity in ASD. To illustrate the potential of such analyses, a case study focused on IQ was performed. Intellectual disability is estimated to impact 40-60% of autistic children, and ASD individuals can also over-inherit common variants associated with high education attainment. The genetic basis of this variation is not well understood. Despite the genetic complexity observed in association with ASD proband IQ, past efforts to identify mutations that contribute to ASD found that these mutations are also negatively correlated with IQ. Specifically, in analyses of exome sequencing data from different ASD cohorts, a significant association was observed between lower IQ and higher burden of de novo coding likely-gene-disrupting (LGD) (see
A pathogenic role of RBP dysregulation in ASD and other complex disorders has been proposed based on observations of deleterious mutations present within coding sequences of genes encoding RBPs. However, little is known with regard to the downstream role that variants along an RNA sequence might play in disrupting RBP-RNA interactions, especially for rare and de novo mutations, primarily due to the difficulty in interpreting the functional impact of RNA dysregulation at scale. To approach this problem, a new machine learning framework, Seqweaver, was developed that incorporates a collection of in vivo mapped RBP binding maps and couples this data with a deep learning algorithm to predict noncoding variant effects on RBP-RNA interaction. The resulting methodology enabled investigation into the impact of noncoding de novo mutations at single nucleotide resolution simultaneously on hundreds of RBPs in a case-control ASD cohort of 2,075 whole genomes. Using Seqweaver, a previously undiscovered excess burden of noncoding de novo RRD mutations among ASD probands compared to their unaffected siblings (a control set providing the critical matching backgrounds) was found, impacting a large collection of RBPs and target transcripts involved in numerous brain developmental processes. Further evidence of a causal role in ASD etiology, it was found that high impact noncoding RRD mutations are associated with the severity of specific phenotypes observed within ASD children, supporting the value of noncoding variants in clinical applications.
Noncoding nucleotide substitutions comprise the largest fraction of autism de novo variants, however, prioritizing clinically relevant variants in noncoding sequences, including those that disrupt RBP binding, has been challenging, especially at a single nucleotide resolution. Modeling RBP binding sites is difficult due to their short degenerate motifs, so a deep learning-based method Seqweaver was developed, which was trained on precise biochemical profiles of RBP-RNA interactions. This training set was used to generate a quantitative model to estimate the binding of RBPs from RNA sequence features alone. Seqweaver leverages a deep convolution network to then integrate evidence beyond a single motif and include surrounding sequence features located up to 500 nucleotides (nt) away. This allows it to take into account features such as potential sites of multiple trans-acting factor binding sites and locations of splice sites (
To build a sequence feature models for each RBP, Seqweaver was trained using in vivo RBP binding profiles mapped using cross-linking immunoprecipitation (CLIP) from a large set of previously published and newly available Encyclopedia of DNA Elements (ENCODE) datasets (
A systematic evaluation of Seqweaver's ability to predict variant effect on RBP binding was conducted by leveraging allelic imbalance occurring at single nucleotide polymorphisms (SNPs) observed in the human population. When a heterozygous SNP overlaps a RBP binding site, the RBP binding preference of the RNA transcribed by the two alleles can be measured by the allelic imbalance of the observed CLIP sequenced reads. A non-disruptive SNP should generate comparable number of RNA CLIP reads from each SNP allele, while a high impact SNP would cause an imbalance in RNA CLIP reads. To generate these evaluation SNPs, the initial analysis was conservatively restricted to heterozygous 1000 Genomes Project variants for which the genotypes for each allele independently in both CLIP and RNA-seq data could be observed from the same sample cells or individual (total 34,781 allelic imbalanced SNPs).
Using these SNPs as an evaluation set, Seqweaver was able to accurately predict the allele with greater RBP affinity, and did so with increasing accuracy as the threshold was increased for the predicted binding difference between the two alleles (
Seqweaver was tested to see if it could accurately predict the variant effect in the human brain, an important task due to the major role neuronal cells are believed to play in determining autism pathogenicity. In a previous work, the in vivo neuronal ELAVL (nELAVL) RBP binding sites in the human prefrontal cortex was mapped by conducting nELAVL-CLIP in 17 postmortem individuals in which the same samples were also subjected to RNA-Seq. Using this data, a total of 1,725 1000 Genomes Project SNPs were identified that overlapped with nELAVL binding profiles in human neuronal cells in vivo. Neuronal RBPs and RNA processing are highly conserved, thus it was hypothesized that Seqweaver trained on mouse nElavl profiles should be able to predict the higher affinity human allele despite being trained on mouse sequence data. The nElavl-CLIP method was performed in adult mouse cortex (3 biological replicates,
Furthermore, Seqweaver predicted the effect on RBP binding interactions for the human genetic variation captured by the 1000 Genomes Project, comprising all SNPs in noncoding exonic regions or introns flanking exons (up to 500 nt, total of 5,504,053 SNPs). SNPs predicted by Seqweaver to be RRD variants were also more likely to be under purifying selection based on their lower minor allele frequency (MAF, compared to regional background) and therefore more likely to be deleterious (
The burden of RBP dysregulation in autism was investigated by applying Seqweaver to de novo variants called from whole genome sequencing (WGS) in a cohort of total 2,075 individuals from the Simons Simplex Collection (SSC). These individuals include 528 ASD probands, 487 unaffected siblings and unaffected parents. Because only one member of these simplex families was diagnosed with autism, the relative contribution of de novo mutations in probands is likely to be high. Previously, whole exome sequencing (WES) on SSC families was used to identify an association between coding de novo likely-gene-disrupting (LGD) mutations and autism pathogenicity. To date, efforts to identify noncoding variant categories linked to ASD pathogenesis have been very limited. Indeed, the number of de novo variants per proband in gene regions and small window surrounding exons showed no significant difference compared to the unaffected siblings when used as control (
Indeed, the proband burden of large effect RRD mutations in noncoding genic regions was significantly larger than the sibling burden (one-sided Wilcoxon rank-sum test p-value=0.02,
Previous reports in autism, schizophrenia and developmental disorders have presented findings of the clustering of rare disruptive coding variants in a collection of genes that are under high purifying selection. It was tested whether highly constrained genes were also enriched for large effect noncoding de novo RRD mutations. Using constrained genes, as defined by the Exome Aggregation Consortium (ExAC), a greater enrichment signature was observed with increasing constraint stringency (
Because fragile X mental retardation protein (FMRP) has been found to be disrupted in ˜2% of ASD patients and is the most common monogenic cause of ASD the targets of FMRP were examined. It was previously demonstrated that FMRP regulates translation of a network of brain mRNAs by stalling ribosome elongation. These FMRP mRNA targets have been subsequently found to be encoded by one of the most highly enriched sets of genetically linked loci in both autism and schizophrenia studies. It was found that the biochemically identified FMRP targets have significant overlap with the highest constrained genes in ExAC (682/1,498 genes overlap with ExAC pLI>0.98 2,130 genes, hypergeometric p-value<1×10−14). In concert with previous ASD studies examining coding regions, it was further found that FMRP targets showed strong proband enrichment for noncoding RRD mutations disrupting numerous RBPs in exon-flanking regions and this enrichment was highest surrounding AS exons (
The etiology of fragile X syndrome (FXS) demonstrates the importance of precise stoichiometry and dosage control for the collection of FMRP targets in the brain. Consequently, it was reasoned that FMRP targets might be subjected to an additional layer of regulation during RNA processing (i.e., upstream of translation) and therefore constitute hotspots for ASD RBP dysregulation. It was tested whether any RBPs' enrichment of high impact proband RRD mutations compared to siblings were more likely to occur in FMRP targets compared to the background constrained genes. Interestingly, two spliceosome associated RBPs, EFTUD2 and SF3B4, were found to have the largest differential burden among FMRP targets (differential burden enrichment for both factors p-value<0.05, permutation test; FMRP targets proband RRD enrichment EFTUD2 p-value=2.2×10−4, SF3B4 p-value=7.6×10−4, one-sided Wilcoxon rank-sum test,
An enrichment analysis was conducted to identify cellular functions and pathways that show an excess burden of high impact RRD mutations (
One of the hallmarks of autism is altered brain development, and a major focus of research has been to understand embryonic or early postnatal development in autism. The noncoding RRD mutations discovered were used together with gene expression RNA-seq data of the developing human brain to conduct an unbiased investigation into the temporal window of autism pathogenicity. For each RNA-seq dataset from an unaffected human brain specimen (prefrontal cortex), an autism risk signature was calculated by testing the up-regulation of expression for genes harboring a proband RRD mutation compared to the control set of mutated genes from siblings. Our analysis (
The clustering of noncoding RRD mutations in connection to gender disparity observed in ASD was also examined. The occurrence of autism is ˜5 times higher among males than females. Previous genetic studies have suggested that females may possess protection against ASD risk variants. When comparing the predicted effects of RRD mutations among constrained genes, the female probands exhibited a significantly higher enrichment of large effect RRD mutations compared to both male probands (p-value=0.041,
Noncoding Mutations are Associated with Clinical Phenotype in ASD
Large collections of studies examining ASD cohorts have identified substantial heterogeneity in their clinical phenotypes. Thus, RBP dysregulation association with clinical diversity among the probands was investigated. Altered social interaction and repetitive or stereotyped behavior are the key clinical indications for diagnosing autism spectrum disorder. Among constrained genes, it was found that probands with high impact noncoding RRD mutations displayed a greater alteration in both social interaction (ADI-R social total, p-value=0.01, Pearson product-moment correlation coefficient test for all) and behavior (ADI-R behavior total, p-value=0.049) (
Intellectual disability is estimated to impact 40-60% of autism children. Accordingly, non-verbal IQ has previously been associated with the ascertainment of de novo coding LGD mutations. Similar to LGD mutations, a significant correlation between non-verbal IQ and the predicted effect of noncoding RRD mutations was observed (p-value=0.02). Among individual RBP models, probands harboring RRD mutations for RBP TDP-43, MBNL and RBFOX showed the greatest association with non-verbal IQ (
A heterogeneous aspect of phenotypic outcome in autistic children is verbal communication. Specifically, verbal regression is characterized by the loss of word and communication skills after the first few years. Unlike IQ, the existence of a genetic link and the subsequent molecular basis of this phenotype has been uncertain. The de novo mutations within constrained genes into two groups based on the probands verbal regression phenotype (word loss or no loss of verbal communication) were segregated). After de novo mutations were stratified by proband phenotype, a statistically significant association between verbal regression and the predicted effect of noncoding RRD mutations was observed (p-value=0.021,
A machine learning approach of deep convolutional neuronal networks (ConvNet) was utilized to build a quantitative model of the RNA sequence features required for each RBP binding. ConvNets allow researchers to design network architectures that can leverage information of high order motifs at different spatial scales but with optimal parameter sharing to avoid overfitting. The ConvNet architecture consists of an initial input layer followed by a series of convolution and pooling layers. The input layer contains a 4×1,000 matrix that encodes the input RNA sequence of U, A, G, C across the 1,000 nt window anchored around the RBP binding site. The subsequent convolution layer looks at 8 nts at a time shifting by 1 nt and computes the convolution operation of 160 kernels. At this first convolution level, the kernels are equivalent to searching for a collection of local sequence motifs in a one-dimensional RNA sequence. Analogues to neurons, a rectifier activation function (ReLU) was then applied such that sets the convolution layer output to a scale of minimum of 0 (i.e. ReLU(x)=max(0,x)). Thus formally, input S results in convolution layer output location n for kernel k as the following:
where I is the window size and J is the input depth (e.g., for the fist convolution layer I corresponds to the local sequence motif length and J represents the four RNA bases).
Next, a pooling layer that allows the reduction of the dimensional size of the network and parameters was added. Specifically, every window of 4 for a kernel output are collapsed into the maximum value observed in that span. Subsequently, the resulting output is used as input for a sequence of convolution (2nd), ReLU, pooling and convolution layer (3rd) in which higher order sequence motifs can be derived based on the first layer local motifs (2nd cony. layer 320 kernels, 3rd cony. layer 480 kernels with identical ReLU and pooling layer).
Finally, a fully connected layer (size human 217, mouse 43) that can now take the resulting output from the three convolution steps to integrate across the entire 1,000 nt context was added to derive a final set of high order sequence motifs. These high order sequence motifs are shared across all RBP models that allow optimal parameter reduction, but also are based on the biological intuition that many RNA sequence features are shared in the cell (e.g., splice sites and branchpoints). The fully connected layer outputs (i.e., high order sequence features) are then subjected to RBP-specific weighted logistic functions (sigmoid, [0,1] scale) allowing for the simultaneous prediction of each RBP binding propensity to the input RNA sequence.
Training the ConvNet for all parameters were conducted using primarily a CLIP-derived training set to minimize the objective function of the following loss function:
Here, i indicate the training examples and j indicates the RBP features. Lji is the training label (0 or 1) for example i and RBP feature j. fj(Si) represents the ConvNet predicted probability of RNA sequence Si of being a binding site for RBP j. For regularization, L2 regularization (λ1) was used for all weighted matrix values, and random dropout of outputs following each convolution-pooling series was applied. The loss function was optimized using a stochastic gradient decent. Full list of parameters used in model is provided below:
Layer 2: 10%
Layer 4: 10%
Layer 5: 30%
All other layers: 0%
231 CLIP binding profiles for 82 unique RBPs and a branchpoint mapping profile were used as input features. In addition, 28 annotated splice site (3′ and 5′) features were including as experimental features, but were not included for subsequent ASD variant impact analysis. ENCODE processed CLIP data was downloaded for uniform peak calling together with non-ENCODE data. All gene regions defined by Ensembl (mouse build 80, human build 75) were split into 50 nts bins. All bins that overlap repeat regions were removed (RepeatMasker). For each bin, RBP features that overlapped more than half were assigned a corresponding positive label. Negative labels were assigned to bins with at least one RBP peak (excluding the RBP of training). CLIP peaks from chromosome 4, 9, 13 and 16 were used for evaluation of input sequence context window. Seqweaver code and input data is available at seqweaver.princeton.edu.
Genome Analysis Toolkit was used and following GATK best practice guidelines for RNA-Seq based genotyping the biological samples (17 postmortem human prefrontal cortex specimens, HeLa, 293T, ENCODE tier 1 cell lines—HepG2 and K562). All raw sequencing files were aligned to the genome using STAR aligner (2.4) followed by HaplotypeCaller (RNA-seq mode) to call variants. To reduce false positive calls, only heterozygous 1000 Genome Project SNPs were used for subsequent analysis. As an additional filter for both accurate variant calling and quantifying allele-specific reads, the WASP methodology that utilizes a post-processing remapping strategy of all reads with the alternative allele to reduce any biases was applied. Any SNP following WASP post-processing (i.e., remapping test of alt. allele reads) that did not have a MAF of >0.01 (ratio of RNA-seq reads derived from minor allele) or read coverage more than 10 were removed from the pool of SNPs for each sample.
Next, the sample specific SNPs were overlaid to the alignment files from CLIP experiments of the same corresponding sample type (total 102 RBP-sample type combinations) using GATK ASEReadCounter tool. Analogues to RNA-Seq, the WASP method was applied to each CLIP derived reads to produce the final CLIP observed genotype and allele-specific read count for each sample. Conservatively, only SNPs that had the same observed genotype from both RNA-Seq and CLIP were used, despite the loss of the most impactful SNPs that lead to complete loss of RBP binding. Additionally, only 1000 Genome Project SNPs were used, excluding any indels that are more challenging to genotype but also might be the result of UV cross-linking process during a CLIP experiment (compared to indels, substitutions do not show locational enrichment within RBP CLIP reads). Finally, only SNPs with >0.5 or <−0.5 log2 odds ratio of CLIP vs RNA-seq allelic ratio were labeled as either reference-biased or alternative-biased SNP (defined based on odds ratio, total 34,781 observed allelic imbalance unique SNPs, Additional Data table S2). All SNPs discovered from each human brain specimens (paired RNA-seq+nELAVL-CLIP) were pooled into one final evaluation set, which resulted in roughly equal ratio of allele biased variants (1.1 ratio of ref. vs alt. biased SNPs—total 1,725 SNPs).
Three biological replicates of adult C57BL/6J mice were used to conduct cortex Elavl-CLIP. Elavl was immunoprecipitated from UV cross-linked cortex samples using an anti-Hu serum that recognizes all three neuronal Elavl isoforms.
Genotyping SSC Families from Whole Genome Sequencing
The Simons Foundation Autism Research Initiative (SFARI) WGS data phase 1 release was used in our study that includes raw data and WGS genotyping according to previous SSC report. Candidate SNVs were further filtered by DNMFilter to identify de novo mutations in proband and siblings with threshold of probability>0.75. The de novo mutations were further isolated by removing any overlap with the 1000 Genomes Project SNVs. In addition, all SVNs located within low complexity regions (RepeatMasker) were removed. Using GENCODE gene annotations (build 25), the final number of de novo SNVs located in gene regions for proband was 9,040 and 8,304 for unaffected siblings.
To make the variant effects across RBP models more comparable within the ASD context, a RBP model specific modified e-value and a p-value was first assigned to each de novo variant. The modified e-value is calculated by merging all proband and sibling de novo variants from the category of interest (e.g., AS exons in FMRP targets) into one pool and assigned the following,
Pr(Xpos,i≥xpos,i|∀Vpos)i or Pr(Xneg,i≤xneg,i|∀Vneg)i
where i is the RBP model, x is the variant margin (i.e., predicted RBPi binding probability difference between reference allele and alternative allele) and V is all de novo variants in the query category. The −log10 margin was modeled as a normal distribution separately for positive and negative margin variants (i.e., predicted gain or loss of binding) but without distinction of proband and sibling origin. The modified e-value provides a measurement of the rarity of a variant's predicted effect with equal treatment to proband and sibling variants, thus ideal when assessing the differential burden between the two groups. P-values were assigned using the same procedure but with a distinction that we model a null distribution by only using sibling variants −log10 margin. A combined score of maximum variant effect on RBP binding was calculated by assigning the minimum e-value across all RBP models to the variant. Finally, z scores were derived after converting the minimum e-values of all variants within the query category into a standard normal distribution (inverse of the normal CDF function using 1—e-value statistics), then computing the z score for each variant.
Human exons that are alternatively spliced were obtained from a recent study that has examined publically available human RNA-seq data to annotate an extensive catalog of AS events. Internal exon region was used for alternative exon definition types of cassette, mutually exclusive, tandem cassette exons. Terminal exon region was used for intron retention, alternative 3′ or 5′ exon AS exon types. All exon-flanking regions, allowing intervals to span across exons, were collapsed into a final set of genomic intervals used to subset SNVs. SNVs were allowed to overlap noncoding exon regions, if the flanking regions overlapped a UTR segment of the gene.
The most updated list of autism coding de novo LGD genes were obtained from Krishnan et al. {Krishnan:2016da}, and release 1.0 of the ExAC functional gene constrained scores were used to obtain pLI (probability of loss-of-function intolerance). An extend list of FMRP targets were used derived from 3 additional biological replicates and including the original 7 replicates FMRP-CLIP {Darnell:2011cy} (1,498 genes, manuscript in preparation, gene list and additional replicate data available upon request prior to publication). Transcripts with FDR<0.05 and coverage of at least 6 biological replicates were defined as FMRP targets and mouse genes were mapped to human genes that satisfy the ENSEMBL defined 1-to-1 or 1-to-many orthologues (i.e., expansion in human lineage) for subsequent analyses.
The differential enrichment of large effect RRD mutations for EFTUD2 and SF3B4 within FMRP targets compared to the background constrained genes (non-targets) was computed by using the difference in t-statistics (predicted effect of proband vs sibling) of the two gene sets as a test statistic. A null distribution was computed by permuting the FMRP target membership label for the collection of de novo mutations within constrained genes for 1,000 iterations. The top 1,000 CLIP peaks for EFTUD2 and SF3B4 (ENCODE CLIP HepG2) were used to conduct motif analysis using the MEME suites {Bailey:2009eu} (MEME and CentriMo) to find significantly enriched sequence elements. Nucleotide level enrichment of motifs was conducted by first searching each instance of the motif using MEME tool FIMO up and downstream 200 nts of AS exons within the gene set. The final enrichment score E was computed as following,
where i is the nt to compute enrichment, mi is the total number of exons with FIMO motif hits overlapping nt location i and Si,j is the FIMO score at nt i in exon j. N is the total number of AS exons examined.
Each GO term test statistic was computed as the following. First proband and sibling de novo mutations that are located within the GO term annotated genes were isolated (400 nt flanking exon regions). Next, each RBP model was tested for increased RBP dysregulation, one-sided Wilcoxon rank-sum test of the predicted effects of proband vs. sibling, for the GO term gene set specific de novo mutations. The summation of the −log10(p-value) of all RBP models was used as the GO term test statistic for the ASD burden of RRD mutations. GO term test statistic was converted to an enrichment p-value by generating a null distribution with 1,000 iterations of permuting the proband/sibling labels for the de novo mutations and repeating the same procedure of obtaining the null test statistic (from random proband/sib labels). Finally, GO terms with p-value<0.05 and FDR<0.1 were reported as enriched for proband RRD mutations. Local FDR was computed using the q-value package. GO term annotations were pooled from human (EBI May 9, 2017), mouse (MGI May 26, 2017) and rat (RGD Apr. 8, 2017) and terms with annotation size of less than 150 or greater than 3,000 genes were removed. Query GO terms were obtained from the merged set of curated GO consortium slims from Generic, Protein Information Resource (PIR), Synapse, Chembl, and supplemented by PANTHER GO-slim and terms from NIGO.
Unaffected human brain (i.e., non-ASD, prefrontal cortex) developmental stage RNA-seq data was used to examine the autism risk signature. For each RNA-Seq biological replicate, gene level abundance was estimated by aligning reads with STAR aligner and estimating the TPM values with RSEM. Genes harboring a proband de novo mutation in 400 nt exon-flanking regions were segregated based on the predicted effect (all, z score>1 or z score<−1) and differential expression statistic was calculated comparing to the expression level of sibling-mutated genes (one-sided Wilcoxon rank-sum test). The level of up-regulation of expression for the proband RRD mutation-harboring genes compared to control (sibling mutated genes) was used as a measure of autism risk signature for the developmental time point.
All proband phenotype information was obtained from the Simons foundation core descriptive variables (version 15, provides summary statistics for each proband clinical phenotypes). The scores were derived from the Autism Diagnostic Interview-Revised (ADI-R) algorithm as described in the SSC phenotype descriptions. Social interaction severity measurement was obtained from the “adi_r_soc_a_total” metric that is the total score for the Reciprocal Social Interaction Domain on the ADI-R algorithm. Behavior severity measurement, the “adi_r_rrb_c_total” metric, is the total score for the Restricted, Repetitive, and Stereotyped Patterns of Behavior Domain. The “regression” phenotype distinction was made, according to the SSC core description, from loss items on the ADI-R loss insert or questions. Verbal communication severity was obtained from the “adi_r_b_comm_verbal_total” metric, which provides the total score for the Verbal Communication Domain on ADI-R. The severity of phenotypes was tested for a positive association with de novo variant predicted effects within constrained genes (ExAC pLI>0.95, consistent significant results p-value<0.05 for each category was also observed for ExAC pLI>0.98). The R implementation of Pearson product-moment correlation coefficient test was used for all.
While the above description contains many specific embodiments of the invention, these should not be construed as limitations on the scope of the invention, but rather as an example of one embodiment thereof. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.
This application claims priority to U.S. Provisional Application Ser. No. 62/622,556 entitled “Methods of Identifying Non-coding Genomic RNA Regulatory Sequences and Sequence Variants and Correlating Them with Phenotypic Variations,” filed Jan. 26, 2018, U.S. Provisional Application Ser. No. 62/622,655 entitled “Methods of Identifying Non-coding Regulatory Genomic Sequences and Sequence Variants and Correlating Them with Phenotypic Variations,” filed Jan. 26, 2018, and U.S. Provisional Application Ser. No. 62/797,926 entitled “Methods for Analyzing Genetic Data to Classify Multifactorial Traits Including Complex Medical Disorders,” filed Jan. 28, 2019, each of which is herein incorporated by reference in its entirety.
This invention was made with Government support under Grants No. HHSN272201000054C, No. HG008901, No. GM071966, No. HL117798, No. HG005998, No. NS034389, and No. NS081706, awarded by the National Institutes of Health. The Government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/015484 | 1/28/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62622556 | Jan 2018 | US | |
62622655 | Jan 2018 | US | |
62797926 | Jan 2019 | US |