The invention generally relates to the identification of epigenetic modification and/or epigenetic regulatory regions of DNA that are associated with the transgenerational inheritance of epimutations using a sequential machine learning approach. In particular, the invention provides the sequential application of Active Learning analysis and Imbalance Class Learner analysis to epigenetic datasets.
The current paradigm for the etiology of heritable diseases, including those caused by environmental insult, is based primarily on mechanisms of genetic alterations such as DNA sequence mutations. However, the majority of inherited diseases have not been linked to specific genetic abnormalities or changes in DNA sequence. In addition, the majority of environmental factors known to cause or influence the development of disease—including heritable diseases—do not have the capacity to alter DNA sequence. Therefore, additional molecular mechanisms need to be taken into account when attempting to clarify the etiology of diseases and to develop diagnostic tools and treatments.
Epigenetics is defined as “molecular factors and processes around DNA that regulate genome activity independent of DNA sequence and are mitotically stable” [1]. The molecular factors currently known to be epigenetic processes include DNA methylation, histone modifications, chromatin structure and selected non-coding RNA [1,3-7]. Epigenetics has been shown to be a critical factor in normal biology, disease etiology and evolution [1,8]. A combination of epigenetic and genetic molecular mechanisms will be essential for nearly all biological processes. However, genetics has been the primary molecular component considered for nearly all aspects of biology. For example, DNA sequence and genetics has been considered the primary form of inheritance. More recently, environmentally induced epigenetic transgenerational inheritance has been described in species from plants to humans [1]. This provides an additional epigenetic mechanism for inheritance to consider [9] and helps explain forms of familial inheritance not easily explained with classical genetics.
Epigenetic transgenerational inheritance is defined as “germline transmission of epigenetic information between generations in the absence of direct environmental exposure” [1]. A growing number of environmental factors have been shown to promote the epigenetic transgenerational inheritance of disease and phenotypic variation from nutrition, stress or toxicants [1,10]. The environmental chemicals shown to promote transgenerational inheritance of disease and sperm epimutations include the agricultural fungicide vinclozolin [1], pesticide permethrin and insect repellent N,N-diethyl-meta-toluamide (DEET) [12], pesticides methoxychlor [13] and dichlorodiphenyltrichloroethane (DDT) [14], plastic derived compounds bisphenol A (BPA) and phthalates [15], and hydrocarbon mixtures (jet fuel, JP8) [16]. The F0 generation gestating female rats were transiently exposed during fetal gonadal development and then the F1, F2 and F3 generations generated [1,11]. The transgenerational F3 generation (i.e., no direct exposure) was found to have a large number of high frequency disease states including testis, ovary, prostate, mammary and kidney disease [17].
Analysis of the F3 generation male sperm demonstrated differential DNA methylation regions (DMRs) that were highly reproducible and exposure specific [18,19]. These DMRs were termed epimutations and ranged in number for genome-wide promoter regions from 30 to 300 depending on the specific exposure [13,14,18]. Each transgenerational set of epimutations was found to be exposure specific with negligible overlap between exposures [1,18]. In addition to the transgenerational sperm epimutations, somatic cell transgenerational epimutations for the agricultural vinclozolin lineage F3 generation testicular Sertoli cells and ovarian granulosa cells were utilized in a similar analysis [20,21]. As found with the exposure specific sperm epimutations, the somatic cell epimutation sets were cell specific with negligible overlap. These somatic cell transgenerational epimutation data sets were also used independently in the current study as training sets for machine learning predictions for somatic cells versus germ cells.
These transgenerational epimutations were used to identify common genomic features associated with the epimutations. The first genomic feature found associated with all epimutations [18] was a low CpG density of less than 10 CpG per 100 bp which were characterized as “CpG deserts” containing small CpG clusters with differential DNA methylation [22] (see also U.S. Patent Publication 2013/0226468 to Skinner et al. herein incorporated by reference). The second set of genomic features identified were unique DNA sequences generally within a few hundred base pair of the differential DNA methylation region [23]. These DNA sequence motifs were previously shown to associate with binding proteins that bend DNA [19,23]. In addition to these genomic features, a number of other genomic features previously shown to associate with epigenetic sites were also selected for the analysis [24].
Despite the various genomic features identified to date, improved genome-wide methods of identifying epigenetic modification and/or epigenetic regulatory regions of DNA that are associated with the transgenerational inheritance of epimutations are urgently needed.
Aspects of the present invention provide a novel machine learning approach to further identify the genomic features of the transgenerational germline epimutations and predict genome-wide sites that may be susceptible to become environmentally modified epimutations.
One aspect of the invention provides a computer-implemented method of identifying potential genomic locations and regulatory sites of epimutations, comprising inputting into a computer at least one genomic DNA sequence; identifying, with said computer, one or more regions of said at least one genomic DNA sequence which comprise one or both of potential locations of epimutations and potential regulatory sites of epimutations by a) training the computer with at least one training set comprising known epimutations to determine a set of potential genomic features associated with the known epimutations; b) using the trained computer to perform Active Learning analysis to identify the optimal genomic features from the set of potential genomic features that allow for the identification of the known epimutations in the training sets; c) using Imbalance Class Learner analysis to correct for data set imbalance; and d) selecting one or more regions in the genomic DNA sequence that contains one or more of the identified optimal genomic features; wherein said one or more regions comprise one or both of potential locations of epimutations and potential regulatory sites of epimutations and wherein said steps b) and c) are performed sequentially or simultaneously.
In some embodiments, steps a)-d) are performed on a server operationally connected to said computer. In some embodiments, the genomic DNA sequence is obtained from a nucleotide sequencing apparatus that is operationally linked to said computer. In other embodiments, the genomic DNA sequence is obtained from a second computer containing a database of genomic DNA sequences. In some embodiments, the computer-implemented method further comprises the step of, with said computer, identifying, within said one or more regions of said at least one genomic DNA sequence, at least one DNA sequence motif that is associated with one or both of epimutations and regulatory sites of epimutations.
Another aspect of the invention provides a system comprising i) a computer; ii) at least one non-transient storage medium comprising computer executable instructions which are performed by said computer and which cause said computer to carry out the steps of a) receiving at least one genomic DNA sequence as input; b) training with at least one training set comprising known epimutations to determine a set of potential genomic features associated with the known epimutations; c) performing Active Learning analysis to identify the optimal genomic features from the set of potential genomic features that allow for the identification of the known epimutations in the training sets; d) using Imbalance Class Learner analysis to correct for data set imbalance; and e) selecting one or more regions in the genomic DNA sequence that contains one or more of the identified optimal genomic features; wherein said steps c) and d) are performed sequentially or simultaneously; and iii) an output device capable of presenting results obtained by said computer in said selecting step.
In some embodiments, the system further comprises a server wherein said computer executable instructions which are performed by said computer cause said computer to carry out steps b) and e) on said server. In some embodiments, the system further comprises a nucleotide sequencing apparatus wherein said at least one non-transient storage medium further comprises instructions for causing said computer to receive said at least one genomic DNA sequence from said nucleotide sequencing apparatus. In some embodiments, the system further comprises a second computer containing a database of genomic DNA sequences wherein said at least one non-transient storage medium further comprises instructions for causing said computer to receive said at least one genomic DNA sequence from said database on the second computer. In some embodiments, the output device is selected from the group consisting of a printer, display, and modem.
Another aspect of the invention provides a method for the early intervention and treatment of a subject who is suspected of or who has been exposed to an environmental agent or who has or is suspected of having a disease or condition of interest, comprising inputting into a computer at least one genomic DNA sequence from said subject and from a positive control; identifying, with said computer, one or more regions of said at least one genomic DNA sequence which comprise one or both of potential locations of epimutations and potential regulatory sites of epimutations by a) training the computer with at least one training set comprising known epimutations to determine a set of potential genomic features associated with the known epimutations; b) using the trained computer to perform Active Learning analysis to identify the optimal genomic features from the set of potential genomic features that allow for the identification of the known epimutations in the training sets; c) using Imbalance Class Learner analysis to correct for data set imbalance; and d) selecting one or more regions in the genomic DNA sequence that contains one or more of the identified optimal genomic features; wherein said one or more regions comprise one or both of potential locations of epimutations and potential regulatory sites of epimutations and wherein said steps b) and c) are performed sequentially or simultaneously; determining the presence or absence of an epigenetic modification within said one or more regions of genomic DNA in said subject and said positive control; comparing the epimutations of said one or more regions of the positive control to the same one or more regions in a genomic DNA sequence of the subject; and administering an appropriate treatment protocol to said subject if said one or more regions of the genomic DNA sequence of the subject contains epigenetic mutations in the same locations as the positive control.
In some embodiments, the environmental agent is selected from the group consisting of vinclozolin, dioxin, permethrin, N,N-diethyl-meta-toluamide (DEET), methoxychlor, dichlorodiphenyltrichloroethane (DDT), bisphenol A (BPA), phthalates, and hydrocarbon jet fuel. In some embodiments, the disease or condition is selected from the group consisting of low sperm production, abnormalities of sexual organs, ovarian cysts, kidney abnormalities, prostate disease, and immune abnormalities.
Many diseases, even those which are passed from parent to offspring, are not caused by genetic mutations. Rather, the causes of these diseases can be traced to epigenetic modifications of the genome. Aspects of the invention provide methods of identifying regions of DNA which are likely to harbor and/or regulate such epigenetic modifications using machine learning analysis.
A machine learning analysis uses a known training set(s) of data to construct a classifier based on known features to classify larger unknown data sets. Generally an issue with machine learning analysis is that a relatively small set of positive traits are used in reference to a much larger set (i.e., volume) of data with negative (non-relevant) traits. This introduces significant bias in the results due to the imbalance between data sets. In addition, often large sets of predicted features are used in machine learning analysis such that only a small number of critical features are relevant. This can also reduce the efficiency and bias the machine learning analysis.
Aspects of the present invention provide two different machine learning techniques to address these issues. Active learning (ACL) is the selection of important features and examples for an Oracle (e.g. a human expert) to classify. The addition of generalized query to the ACL allows selection of the optimal features in these examples which the Oracle can classify. The Oracle uses the optimal features identified by ACL, to then do imbalance learning and eventually the prediction. ACL can also be used to select the most important features and provide insights into the critical features identified. Imbalance class learners (ICL) can be used to reduce the data set imbalance bias and allow for a more accurate analysis. These two techniques facilitate the training for the machine learning classifier.
Embodiments of the present invention use a novel two-step (sequential) machine learning analysis involving a combination of an initial active learning step followed by an imbalance class learner (ACL-ICL) protocol (
The epigenetics datasets can be from epigenetic transgenerational inheritance experiments and F3 generation sperm or somatic cells from various exposure lineages, including Dioxin [46], Hydrocarbon Jet Fuel [16], Vinclozolin [16,18,19,46], Plastics [15], and Pesticide [12,15]. In some embodiments, somatic Sertoli cells and Granulosa cell datasets [20,21] are derived from adult vinclozolin lineage F3 generation somatic cells that influence the onset of testis and ovarian disease, respectively. The datasets for the germ cell and somatic cell DMR sites [54] have differential DNA methylation changes between the F3 generation exposure and control lineages rat cells. These epigenetic data come from investigations of the actions of environmental exposures during fetal gonadal development that induce epigenetic change in the germ line and promote the epigenetic transgenerational inheritance of adult-onset diseases [3]. The Dioxin, Jet Fuel, Vinclozolin, Plastics and Pesticide datasets consist of ancestral environmental exposures of these five compounds individually and are associated with the epigenetic transgenerational inheritance of adult onset diseases. In some embodiments, the molecular procedure to identify the DMR is a differential methylated DNA immunoprecipitation (MeDIP) followed by a tiling array analysis (Chip) for a MeDIP-Chip analysis. In some embodiments, an additional validation is done using two sperm DMR data sets and a combination of the DDT [14] and MXC [13] sperm epimutations is used as a positive control (DDT MXC with 76 DMR).
In some embodiments, the methods of the invention are used to identify a genome-wide set of potential epimutations that can be used to facilitate identification of epigenetic diagnostics for ancestral environmental exposures and disease susceptibility. As described in the Example herein, the input to the system are datasets with all features. A generalized query based ACL method can be used to find the most important samples and features for the epigenetic datasets. These features are annotated for the epimutation regions, the identified DNA methylation regions (DMRs), as well as sequences upstream and downstream of the DMRs. The most relevant features of each of the datasets are combined and the ACL is trained on these features sets. Once ACL training is complete, ICL training is used for prediction across the whole genome for each germ cell and somatic cell data separately. Once the ICL training is complete, a prediction on the whole genome is made. Thus, the approach allows for the identification of potential new DMRs by first constructing a robust classifier (using the active learning and imbalanced class learning approach) which minimizes false positives, and then scanning the genome for locations which are highly likely to be DMRs. Although previous machine learning approaches applied active learning or imbalance class learning independently, the sequential use for a biological data set is novel.
The methods disclosed herein of using active learning and imbalanced class learning in a combined approach over traditional machine learning classification has distinct advantages. Biological datasets come with a set of inherent problems. Most data that researchers are interested in (e.g. positive cases) are rare (i.e. imbalanced) in contrast to all other characteristics or features. Efficient learning can be performed only when target concepts from both classes (e.g. DMR and non-DMR) are learned well to distinguish them separately while learning from only the relevant features. Such interesting computational problems can be approached using specific machine learning techniques. The present invention allows for the identification of the most relevant features and addresses the class imbalance problem. The genomic characteristics of the DMRs are used as features for the learners. Active learning intelligently chooses the best instances/features to learn from. In some embodiments, the approach uses Generalized Query Based Active Learning (GQAL) which not only can choose the best features to learn from, but also selects the most relevant features for learning. This is accomplished by constructing intelligent queries by removing irrelevant features from the query which an Oracle can answer easily. This approach allows the learner to label multiple instances at the same time instead of labeling one instance per query. In addition, instead of using a global feature reduction (where a set of features are removed in the beginning of the training) GQAL uses a subset of features at each iteration by using local feature selection. This makes use of most of the power of the features and it maximizes the use of a subset of features for learning. The GQAL approach has been tested on 13 datasets besides epigenetics and compared with 3 other classifiers (KNN, SVM and NB) and later with (AdaBoost, Decision Trees, RandomForest and Logistics) and the GQAL was found to be the most efficient for the epigenetic dataset. Aspects of the present invention, combine these two approaches into a single sequential computational tool.
Instead of using an under-sampling or an oversampling technique as done previously to reduce or increase the size of each of the classes to make them balanced, in some embodiments, the approach described herein uses a boosting technique termed AdaBoost or “Adaptive Boosting” [58,59]. Boosting is a method to increase weights of certain examples while decreasing the weights of other examples for efficient balanced learning. This approach allows the learning algorithm to learn target concepts well from both classes. This addresses the imbalance class problem. For the AdaBoost algorithm, a weak classifier termed Tree Augmented Bayesian Network (TAN) [60] may be chosen as the classifier. This is a restrictive Bayesian leamer which performs better than the Naïve Bayes Classifier (NBC) [61]. The TAN boosted imbalances class leamer has been tested on 5 datasets including 2 epigenetic datasets and compared with 2 other imbalanced class learners (Subset Sampling Optimization and EasySensemble) and 5 other regular classifiers (SVM, Logistics, Decision Trees, RandomForest and AdaBoost) [31] and the TAN AdaBoost was found to be the most efficient in the epigenetic dataset.
“Epimutation” and “epigenetic modification” as used herein refer to modifications of cellular DNA that affect gene expression without altering the DNA sequence. The epigenetic modifications are both mitotically and meiotically stable, i.e. after the DNA in a cell (or cells) of an organism has been epigenetically modified, the pattern of modification persists throughout the lifetime of the cell and is passed to progeny cells via both mitosis and meiosis. Therefore, within the organism's lifetime, the pattern of DNA modification and consequences thereof, remain consistent in all cells derived from the parental cell that was originally modified. Further, if the epigenetically modified cell undergoes meiosis to generate gametes (e.g. eggs, sperm), the pattern of epigenetic modification is retained in the gametes and thus inherited by offspring. In other words, the patterns of epigenetic DNA modification are transgenerationally transmissible or inheritable, even though the DNA nucleotide sequence per se has not been altered or mutated. Without being bound by theory, it is believed that enzymes known as methyltransferases shepherd or guide the DNA through the various phases of mitosis or meiosis, reproducing epigenetic modification patterns on new DNA strands as the DNA is replicated.
Exemplary epigenetic modifications include but are not limited to DNA methylation, histone modifications, chromatin structure modifications, and non-coding RNA modifications, etc.
“Epigenetic control region” or “ECR” refers to a segment of DNA which is at least about 400 bp in length, and which is characterized by (contains, comprises, harbors, etc.) at least one of the features described herein, such as differential DNA methylation, a low CpG density (e.g. of about 15% or less), DNA sequence motifs (e.g. EDM1, EDM2), etc. Such DNA segments encompass at least one epimutation and/or at least one epimutation regulatory site. ECRs comprise at least about 400 contiguous base pairs, and may contain up to about 1000 bps (e.g. about 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950 or more) base pairs. In some embodiments, the regions are even larger, e.g. about 1000 or more bps. One or more copies of each DNA sequence motif may be present in a region.
Epigenetic modifications may be caused by exposure to any of a variety of factors, examples of which include but are not limited to: chemical compounds e.g. endocrine disruptors such as vinclozolin; chemicals such as those used in the manufacture of plastics e.g. bisphenol A (BPA); bis(2-ethylhexyl)phthalate (DEHP); dibutyl phthalate (DBP); insect repellants such as N, N-diethyl-meta-toluamide (DEET) and dichlorodiphenyltrichloroethane (DDT); pyrethroids such as permethrin; various polychlorinated dibenzodioxins, known as PCDDs or dioxins e.g. 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD); hydrocarbon mixtures such as jet fuel; extreme conditions such as abnormal nutrition, starvation, etc.
In some embodiments, the methods as described herein involves obtaining the nucleotide sequence of a selected DNA sequence of interest (e.g. by obtaining a DNA sample from a donor or subject and then sequencing the DNA within the sample; or obtaining a known nucleotide sequence from a database), and then analyzing the nucleotide sequence. Computer executable algorithms and software programs for implementing the same are encompassed by the invention. The software program may contain instructions for causing a computer to carry out the steps of the methods disclosed herein. The computer program will be embedded in a non-transient medium such as a hard drive, DVD, CD, thumb drive, etc.
In some embodiments, the nucleotide sequence of the DNA sequence of interest may be unknown and it may be necessary to carry out a step of sequencing. Those of skill in the art are familiar with techniques that may be used to sequence DNA, including but not limited to: the Maxam-Gilbert chemical degradation method, the Sanger dideoxy chain termination technique, etc. DNA sequencing has been summarized in many review articles, e.g., B. Barrell, The FASEB Journal, 5, 40 (1991); and G. L. Trainor, Anal. Chem. 62, 418 (1990), and references cited therein. The most widely used DNA sequencing chemistry is the enzymatic chain termination method of Sanger, mentioned above, which has been adopted for several different sequencing strategies. The sequencing reactions are either performed in solution with the use of different DNA polymerases, such as the thermophilic Taq DNA polymerase [M. A. Innes, Proc. Natl. Acad. Sci. USA, 85: 9436 (1988)] or specially modified T7 DNA polymerase (“SEQUENASE”) [S. Tabor and C. C. Richardson, Proc. Natl. Acad. Sci. USA, 84,4767 (1987)], or in conjunction with the use of polymer supports. See for example S. Stahl et al., Nucleic Acids Res., 16, 3025 (1988); M. Uhlen, PCT Application WO 89/09282; Cocuzza et al., PCT Application WO 91/11533; and Jones et al., PCT Application WO 92/03575.
In other embodiments, the nucleotide sequences of the DNA sequence(s) of interest have already been determined and are retrieved e.g. from a database. Such databases, many of which are publically available, are well known to those of skill in the art, e.g. GenBank.
Selection of a DNA sequence of interest may be predicated on and/or influenced by any number of factors. For example, the DNA sequence of interest may be from a particular species under study (e.g. a mammalian species, including but not limited to humans); the DNA sequence of interest may be from a particular chromosome or region of a chromosome that is suspected to be involved in a disease or condition of interest; etc. The DNA sequence of interest may be isolated from a subject or subjects known or suspected to be afflicted with a disease or condition associated with epigenetic mutations; or who have been or are suspected of having been exposed to an agent that causes, or is suspected of causing, epigenetic mutations; or who have inexplicably inherited a disease or disease condition from a parent for which no DNA sequence mutation has been identified, etc. Subjects whose DNA is analyzed may be or any age or gender, and in any stage of development, so long as cells containing a DNA sequence of interest can be obtained from the subject. For example, the subject may be an adult, an adolescent, a child, an infant, an embryo, a laboratory animal, etc. The cells from which the DNA is obtained may be any suitable cell, including but not limited to gametes, cells from swabs such as buccal swabs, cells sloughed into amniotic fluid, etc.
The genomic features described herein may be used in a variety of therapeutic applications. For example, they may be used to identify locations of epigenetic modification, or locations that are susceptible to epigenetic modification, within a gene sequence of interest. The gene sequence of interest may be a chromosome or a region of interest within a chromosome. Once identified, such regions can serve as biomarkers to be used e.g. in disease diagnosis and/or to detect environmental exposures to agents or conditions that cause epimutations and/or to monitor therapeutic responsiveness to a medicament or treatment and/or used as prognostic indicators. For example, once a particular location on a chromosome is determined to be a region with a high incidence of epigenetic modifications associated with a particular disease or syndrome, or with exposure to a particular agent or event (e.g. exposure to dioxin), then subjects with or without symptoms of exposure can be screened using a diagnostic that detects epigenetic modification of the region. The detection of epigenetic modification at the region (i.e. a positive diagnostic result) will suggest or confirm that the subject has, indeed, likely been exposed to dioxin, and treatments suitable for dioxin exposure can be instituted. In contrast, a negative result (no epigenetic modification at the site) suggests that the subject has not been exposed to dioxin (or at least that the exposure did not result in damage), and other reasons for disease symptoms displayed by the subject can be investigated. If it is known that exposure did occur, then prophylactic screening of a DNA sample from a patient can result in early identification of a risk of disease and lead to early therapeutic intervention. In addition, ongoing monitoring of the extent of epigenetic modification of a site can provide valuable information regarding the outcome of the administration of agents (e.g. drugs or other therapies) which are intended to treat or prevent a condition caused by epimutation, i.e. the therapeutic responsiveness of a patient. Those of skill in the art will recognize that such analyses are generally carried out by comparing the results obtained using an unknown or experimental sample with results obtained a using suitable negative or positive controls, or both.
Information concerning the type and extent of epigenetic modification in a subject may be used in a variety of decision making processes undertaken by a subject that is tested. For example, depending on the severity of the symptoms caused by an epigenetic modification that is identified, a subject may decide to forego having children or to terminate a pregnancy in order to prevent transmission of the modification to offspring. Diagnostic tests based on the present invention can be included in prenatal testing.
In other embodiments, the regions identified as described herein may be monitored in order to ascertain whether or not administration or exposure to an agent or environmental stimulus causes epimutations. For example, candidate drugs or other treatments that are found to cause epigenetic modifications, for example, in cell or animal studies, or during clinical trials, might be avoided or used only as a last resort in a clinical setting, or rejected altogether as viable drug candidates.
Subjects whose DNA is analyzed may be suffering from any of a variety of disorders (diseases, conditions, etc.) including but not limited to: various known late or adult onset conditions, such as low sperm production, abnormalities of sexual organs, ovarian cysts, kidney abnormalities, prostate disease, immune abnormalities, behavioral effects, etc. In other embodiments, no symptoms are present but screening using the diagnostics is employed to rule out the presence of “silent” epigenetic mutations which could cause disease symptoms in the future, or which could be inherited and cause deleterious effects in offspring.
The regions that are identified as described herein may also be used to screen and identify therapeutic modalities for the treatment of epigenetic mutations. Those of skill in the art will recognize that such methods of screening are typically carried out in vitro, e.g. using a DNA sequence that is immobilized in a vessel, or that is present in a cell. However, such tests may also be carried out in model laboratory animals, once the regions are identified. In one embodiment, candidate agents which reverse epigenetic modification are screened by analyzing the regions. In another embodiment, candidate agents which prevent epigenetic modifications are screened by analyzing the regions. In this way, the epigenetic biomarkers can be used to facilitate, e.g. drug development and clinical trials patient stratification (i.e. pharmacoepigenomics).
The invention also provides a system for carrying out the methods of the invention. The system comprises, for example, i) a computer; and ii) non-transient storage medium comprising computer executable instructions which are performed by the computer and which cause the computer to carry out the steps of a) receiving at least one genomic DNA sequence as input; b) scanning said at least one genomic DNA sequence using Active Learning analysis; and c) scanning said at least one genomic DNA sequence using Imbalance Class Learner analysis wherein said steps b) and c) are performed sequentially or simultaneously. The system also generally comprises iii) an output device capable of presenting results obtained by the computer during or as a result of (e.g. in) scanning steps. The system may further comprise a server wherein said computer executable instructions which are performed by the computer cause the computer to carry out steps b) and c) on the server.
The non-transient storage medium may be on the hard drive of the computer, or may be located on a portable device such as a disc, CD, DVD, thumb drive, flash drive, lap top, portable computer (e.g. a PC or other type), or other such device. Alternatively, the non-transient storage medium may be at a location such as a remote location or a database that is accessible via the internet, or stored in a cloud, or in or on another computer or computer system that is accessible by the computer of the system. The non-transient storage medium may also include instructions for causing the computer to receive, as input, at least one genomic DNA sequence from a nucleotide sequencing apparatus or from a database. The database may be downloaded from a remote site (e.g. via the internet), and/or may be located (stored) on the computer, or may be located on another computer or computer system that is accessible by the computer of the system, or may even be located on a portable device as described above. In other embodiments, the data is downloaded from a gene sequencing apparatus, and the system may also include such an apparatus. If present, the apparatus is operably electronically linked to the computer in a manner that allows data gathered or measured by the sequencing apparatus (e.g. a nucleotide sequence) to be outputted and transmitted to and received as input by the computer.
The computer or server can carry out the analysis of one genomic sequence at a time, or, in some embodiments, can analyze two or more sequences at the same time, e.g. by aligning them and scanning them simultaneously. Similarly, the output device may output the results of the scanning steps for one or multiple sequences at the same time.
The output device may be of any suitable type, including but not limited to a printer, a display (e.g. a monitor that displays the results as a list, as a graph, or in some other suitable format), or a modem that sends out information (e.g. to another output device, to another computer, or to a storage device such as a DVD, CD, etc.).
Such a system is illustrated schematically in
For active learning each of the datasets used can be described as a collection of examples each containing a number of features X1, X2 . . . Xn and class label Y. Initially the learner is given a small training set R and a set U of unlabeled training instances. From this unlabeled training set, the learner can query the Oracle to label these instances. The Generalized Query Based Active Learning (GQAL) approach is described in the following steps (
1. Initially, at step 201, the learner L is trained on a small set of labeled examples R, there is a set U of unlabeled training instances, and two separate test sets T1 and T2.
2. The classifier learned by leamer L is used on the unlabeled training set U in step 202 to find the most uncertain instance [54].
3. GQAL then takes the chosen uncertain instance and finds the most relevant features for that instance and their ranges in step 203.
4. The process then poses the generalized query in step 204 to the Oracle (Expert), which gives a label and a probability estimation which is the Oracle's confidence about the query label.
5. GQAL takes this generalized query and matches it with existing instances in step 205. Such unlabeled instances are labeled and moved from the unlabeled dataset U to the labeled training set R.
6. The process learns from this updated training set R and tests on the set aside test set T1 in step 206.
7. GQAL goes back to step 202 and repeats this until it reaches a predefined accuracy or iterates a certain number of times in step 207.
8. Once learning is complete the final GQAL classifier from learner L is evaluated on the set aside test set T2 in step 208.
In brief, the GQAL takes a large set of features and known training sets having known epimutations, and individually determines the optimal features associated with the known epimutations. This is done for each feature separately and then those features that contribute to the positive identification of the known epimutations during training are selected for future use in the analysis by the Oracle (e.g. human performing the analysis). This is repeated with different training sets and increased number of known epimutations to develop the algorithm used for the subsequent analysis with ICL.
In some embodiments, the Tree Augmented Naive Bayes (TAN) is used as a base classifier for the GQAL learner. Details of this algorithm is given in the GQAL paper [30]. In some embodiments, after running active learning on the entire feature set, the features which appeared as don't care or irrelevant features are removed and features that appeared five times or more are selected as the top features for the dataset. Once the most important features are chosen, they are used for imbalanced class learning which is the next step in the combined approach.
In some embodiments, the ICL uses a boosting technique called AdaBoost that makes use of the entire dataset. It uses a committee of experts (weighted classifiers) to classify any new instance based on majority voting. For the training, initially all instances in the dataset have equal weights. In each iteration AdaBoost increases the weight on the incorrectly classified instances and decreases the weight on the correctly classified instances. After each iteration the classifier, which minimizes the error, is chosen as a committee expert and used to update all the instances for the next iteration. Similar to GQAL the TAN classifier is used as a base classifier with AdaBoost. The objective with the ICL is to correct for imbalance in the data sets. For example, the majority of sites in the genome are non-epimutation sites and a small number are potential epimutations. The ICL corrects for this as described above with established machine learning tools and weighting of the data sets. This contributes to an algorithm that will facilitate the prediction of the potential epimutation sites and genomic locations.
In a combined approach first the active learning is used to select the most important features at each iteration and then the imbalanced class learner is used as a boosting method to maximize the accuracy while learning from an imbalanced dataset. This combined approach (GQAL+(TAN+Adaboost)) is a novel technique than other tightly integrated approaches.
Before exemplary embodiments of the present invention are described in greater detail, it is to be understood that this invention is not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present invention, representative illustrative methods and materials are now described.
All publications and patents cited in this specification are herein incorporated by reference as if each individual publication or patent were specifically and individually indicated to be incorporated by reference and are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. The citation of any publication is for its disclosure prior to the filing date and should not be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.
It is noted that, as used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.
As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present invention. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.
The invention is further described by the following non-limiting example which further illustrates the invention, and is not intended, nor should it be interpreted to, limit the scope of the invention.
Environmentally induced epigenetic transgenerational inheritance of disease and phenotypic variation involves germline transmitted epimutations. The primary epimutations identified involve altered differential DNA methylation regions (DMRs). Different environmental toxicants have been shown to promote exposure (i.e., toxicant) specific signatures of germline epimutations. Analysis of genomic features associated with these epimutations identified low-density CpG regions (<3 CpG/100 bp) termed CpG deserts and a number of unique DNA sequence motifs. The rat genome was annotated for these and additional relevant features. The objective of the current study was to use a machine learning computational approach to predict all potential epimutations in the genome. A number of previously identified sperm epimutations were used as training sets. A novel machine learning approach using a sequential combination of Active Learning and Imbalance Class Learner analysis was developed. The transgenerational sperm epimutation analysis identified approximately 50K individual sites with a 1 kb mean size and 3,233 regions that had a minimum of three adjacent sites with a mean size of 3.5 kb. A select number of the most relevant genomic features were identified with the low density CpG deserts being a critical genomic feature of the features selected. A similar independent analysis with transgenerational somatic cell epimutation training sets identified a smaller number of 1,503 regions of genome-wide predicted sites and differences in genomic feature contributions. The predicted genome-wide germline (sperm) epimutations were found to be distinct from the predicted somatic cell epimutations. Validation of the genome-wide germline predicted sites used two recently identified transgenerational sperm epimutation signature sets from the pesticides dichlorodiphenyltrichloroethane (DDT) and methoxychlor (MXC) exposure lineage F3 generation. Analysis of this positive validation data set showed a 100% prediction accuracy for all the DDT-MXC sperm epimutations. Observations further elucidate the genomic features associated with transgenerational germline epimutations and identify a genome-wide set of potential epimutations that can be used to facilitate identification of epigenetic diagnostics for ancestral environmental exposures and disease susceptibility.
A previous study used known imprinted genes and associated genomic features in both mouse and humans to predict additional imprinted genes [25,26]. This study identified critical genomic features and demonstrated approximately 600 new potential imprinted genes [25]. Although this previous analysis investigated a distinct epigenetic process (i.e., imprinting), a similar rationale was used in the current study. The approach used known transgenerational sperm epimutation data sets from a variety of exposures as a training set for a machine learning analysis. A similar approach was used with transgenerational somatic cell epimutation data sets to determine differences and similarities between the germline and somatic cell epimutations. The genomic features previously identified and additional features were used to identify genome-wide regions susceptible to become transgenerational epimutations.
The objective is to utilize a novel machine learning approach with known transgenerational sperm epimutations and associated genomic features to predict genome-wide regions that have a susceptibility to develop into transgenerational epimutations. Observations provide insights into the genomic features associated with epimutations and help understand why these sites may be transgenerationally programmed. Previous studies [1,18] have suggested exposure specificity in epimutations, as well as disease susceptibility later in life. Therefore, genome-wide transgenerational epimutation data sets for germ cells and somatic cells will be invaluable in future identification of diagnostics for environmental exposures and later life disease susceptibility.
Epigenetic Datasets. The epigenetics datasets are from epigenetic transgenerational inheritance experiments and F3 generation sperm or somatic cells from various exposure lineages, including Dioxin [46], Hydrocarbon Jet Fuel [16], Vinclozolin [16,18,19,46], Plastics [15], and Pesticide [12,15]. The somatic Sertoli cells and Granulosa cell datasets [20,21] are derived from adult vinclozolin lineage F3 generation somatic cells that influence the onset of testis and ovarian disease, respectively. The datasets for the germ cell and somatic cell DMR sites [54] have differential DNA methylation changes between the F3 generation exposure and control lineages rat cells. These epigenetic data come from investigations of the actions of environmental exposures during fetal gonadal development that induce epigenetic change in the germ line and promote the epigenetic transgenerational inheritance of adult-onset diseases [3]. The Dioxin, Jet Fuel, Vinclozolin, Plastics and Pesticide datasets consist of ancestral environmental exposures of these five compounds individually and are associated with the epigenetic transgenerational inheritance of adult onset diseases. The molecular procedure to identify the DMR was a differential methylated DNA immunoprecipitation (MeDIP) followed by a tiling array analysis (Chip) for a MeDIP-Chip analysis and the details of how each experiment was performed and data was collected is previously described [18,20,21]. An additional validation was done using two recently identified sperm DMR data sets. A combination of the DDT [14] and MXC [13] sperm epimutations is used as a positive control (DDT MXC with 76 DMR).
Active Learning. For active learning each of the datasets used can be described as a collection of examples each containing a number of features X1, X2 . . . Xn and class label Y. Initially the learner is given a small training set R and a set U of unlabeled training instances. From this unlabeled training set, the learner can query the Oracle to label these instances. The GQAL approach is described in the following steps:
1. Initially the learner L is trained on a small set of labeled examples R, there is a set U of unlabeled training instances, and two separate test sets T1 and T2.
2. The classifier learned by learner L is used on the unlabeled training set U to find the most uncertain instance [54].
3. GQAL then takes the chosen uncertain instance and finds the most relevant features for that instance and their ranges.
4. The algorithm poses the generalized query to the Oracle, which gives a label and a probability estimation which is the Oracle's confidence about the query label.
5. GQAL will take this generalized query and match it with existing instances. Such unlabeled instances are labeled and moved from the unlabeled dataset U to the labeled training set R.
6. The algorithm learns from this updated training set R and tests on the set aside test set T1.
7. GQAL goes back to step 2 and repeats this until it reaches a predefined accuracy or iterates a certain number of times.
8. Once learning is complete the final GQAL classifier from learner L is evaluated on the set aside test set T2.
The Tree Augmented Naive Bayes (TAN) is used as a base classifier for the GQAL learner. Details of this algorithm is given in the GQAL paper [30]. After running active learning on the entire feature set of 834 features, the features which appeared as don't care or irrelevant features were removed and features that appeared five times or more were selected as the top features for the dataset. This ended up being 149 features for SG and 134 features for DHVPP. The entire list of genomic features is given in Tables 1 and 2. They are grouped into CpG information, repeat elements, transcription factors, sequence motifs and mammalian motifs. Once the most important features were chosen, they were used for imbalanced class learning which is the next step in the combined approach.
Imbalanced Class Learner. The ICL uses a boosting technique called AdaBoost that makes use of the entire dataset. It uses a committee of experts (weighted classifiers) to classify any new instance based on majority voting. For the training, initially all instances in the dataset have equal weights. In each iteration AdaBoost increases the weight on the incorrectly classified instances and decreases the weight on the correctly classified instances. After each iteration the classifier which minimizes the error is chosen as a committee expert and used to update all the instances for the next iteration. Similar to GQAL the TAN classifier is used as a base classifier with AdaBoost.
The two-step DMR identification machine learning framework is as shown in
Both the GQAL and TAN+AdaBoost approach were trained with 10 fold cross validation with the DHVPP and SG data. The models created from these two training sets were separately tested for validity using the MXC-DDT and Sox9SryTcf21 datasets. Validation results show that both the datasets SG and DHVPP can identify DMR dataset MXC-DDT properly and can identify non-DMR, non-epigenetic dataset Sox9SryTcf21 as non-DMR with some restrictions.
Clustering. After the potential DMR sites (1,503 for SG and 3,233 for DHVPP) were extracted, further analysis of the data was done to find if these novel potential DMR sites cluster in certain locations in the genome. A previous study with tissue gene expression array data was used in a cluster analysis of transgenerational differentially expressed genes to identify gene clusters with statistically significant over-represented gene expression [35]. These locations were termed Epigenetic Control Regions (ECRs). A similar analysis for DMR sites was done to find whether such ECR regions exist for the predicted epimutation sites. An overlapping sliding window size of 2,000,000 base was used at an interval of 50,000 base to count the number of potential DMR within the sliding windows. Then a Z-test was performed and p-value of 0.05 statistically significant cut-off, including false discovery analysis, was used to find the windows with over-representations of predicted DMR sites. Then consecutive overlapping windows were merged to form the final list of clusters.
Feature Extraction. The feature extraction included using RepeatMasker, Motif discovery tools and consensus sequences obtained from JASPER and other sources [20]. Features were extracted from the base region, 1 k, 5 k and 100 k upstream and downstream. A non-overlapping region of 1000 bases was used to scan all the chromosomes of the rat to create the testing regions and then features were collected from these regions and around it (having the 1000 bases as a base region). The same features were used for training and testing for each individual dataset.
The machine learning approach used in this study (
The selected 834 genomic features can be grouped into four sub-groups (Table 4). They are CpG density and related information (3 total features), repeat elements (216 total features), transcription factors (207 total features) and DNA sequence motifs (60 total features). The sequence motif group has a subgroup called mammalian motifs (348 total features) as these features were collected from the online JASPER dataset [32]. All these features were annotated for the epimutation regions (the identified DMR regions), as well as for sequences 1 k, 5 k, and 100 k upstream and downstream of the DMRs. ACL was run on the DHVPP and SG datasets separately and only those features that appeared greater than 5 times, as well as some manually selected important features were chosen as the most relevant features for further analysis (Tables 5 and 6). This information for each of these datasets was combined and ACL trained on these feature sets. Once ACL training was complete ICL training was used for prediction across the whole genome for each germ cell and somatic cell data set separately (
Since most of the DMR locations are found within 600 bp to 1500 bp windows, a non-overlapping sliding window of 1000 bp was used on each chromosome to identify potential DMR candidate sites. The original 834 selected genomic features were extracted/annotated for the entire rat genome DNA sequence. The number of initial extracted/annotated feature sets is shown in Table 4. For each of the 21 rat chromosomes (autosomes and X chromosome) a sliding non-overlapping window size of 1000 bases was used to create a total of 2,630,424 sites. In the same manner as the training dataset, FASTA files were created. RepeatMasker was run and finally a list of 834 features was extracted from each of these sites. This is the test set used for prediction. Once the training was complete, a prediction on the whole genome was made. This approach to find potential new DMRs is the first to construct a robust classifier (using both imbalanced class and active learning approach) which minimizes false positives, and then scan the genome for locations which are highly likely to be DMRs,
Once these features were identified, annotated and extracted from the training datasets, active learning was used to find the most relevant features. The features which appeared 5 or less times were considered don't care attributes (irrelevant features) and a set of manually selected features was taken as the list of most relevant features. The most relevant features for the two training datasets are presented in Tables 1, 5, and 6. The list of features include the following categories: (a) CpG information (b) repeat elements (c) transcription factors (d) sequence motifs and (e) mammalian motifs. The CpG Information contains three features: length of the sites in base pair, number of CpG sites, and CpG density (number of CpG sites per 100 bases). The transgenerational epimutations have been found in low CpG density regions (termed CpG deserts) [22]. The genomic feature of low CpG density was found to be one of the most important features for both the somatic and germ cell prediction datasets. The repeat elements original list contained a total of 216 repeat features. Both the somatic and sperm datasets had 32 repeat elements (with significant overlaps) in their final list of somatic 134 and sperm 149 features (Tables 4-6). The original transcription factor group contained 207 features. In the final list for sperm (DHVPP) there were 32 transcription factor features and for the somatic cells (SG) there were 41 features. The DNA sequence motifs [33,34] had 60 original features selected for this study. For the sperm (DHVPP) dataset there are 11 sequence motif features and for the somatic (SG) dataset there are 4 sequence motifs critical features. Mammalian motifs originally considered involved 348 features from the JASPER dataset [32]. For the sperm (DHVPP) there were 58 mammalian motif features while for the somatic cell (SG) there were 71 of them (Tables 5 and 6).
Once the final list of features was selected for the two datasets they were used for training in the ICL, and used for the genome wide prediction. The sperm and somatic cell analysis was done separately with the relevant list for each. The initial number of predicted epimutation sites identified was 48,557 sites for the sperm (DHVPP) and 28,564 sites for the somatic cells (SG). However, after an initial number of individual sites were found, only those with three or more consecutive sites were merged to create the most stringent list of potential susceptible DMR sites. The reason for focusing on three or more consecutive sites is that single predicted sites have a lower statistical significance and a higher potential for false positives. Although the single sites are viable potential DMR to consider, a more stringent analysis of DMR was used of three or more consecutive probes being present to further investigate the potential differential DNA methylation regions. These three or more consecutive sites were merged to create the list of potential susceptible DMR sites. The final list of potential DMR for the sperm DHVPP analysis was 3,233 sites and for the somatic cell SG analysis was 1,503 sites.
The chromosome plots for the datasets DHVPP (
The following analyses investigated the genomic features of the predicted DMR/epimutations. The initial analysis was to check the CpG density of the regions which were identified as potential DMRs. The predicted DMR CpG density (number of CpG in each 100 bases) distribution was determined and shown in
Transcription factor binding sequence motifs and mammalian sequence motifs were the next features investigated. These features were collected from the DMR region and upstream and downstream of the DMR. Features were extracted from 1 k, 5 k and 100 k upstream and downstream regions of the DMR region. The consensus sequence correlations to the prediction of DMRs are shown in
The repeat elements were chosen as a group of features (based on their location and distance from the DMR region) and for the predicted DMR that had the feature, prediction power was calculated to see which repeat elements gave the highest accuracy. All the repeat elements were grouped into 1 k, 5 k, 100 k upstream and downstream. The predictive power of repeat elements for DHVPP and SG is shown in
A comparison was made between the genome-wide predicted DMR/epimutation in the germ cell data sets and somatic cell data sets. The distribution of the predicted DMR on the various chromosomes is shown in Tables 9 and 10. Overlap between the potential predicted DMR sets derived from the germline DHVPP and somatic SG datasets showed only five common predicted sites (
In order to help validate the machine learning results for the predicted germ cell DMR data set a positive validation analysis was performed. For the positive validation analysis the predicted DMR datasets were compared to two more recently developed sperm DMR datasets which were not used as test sets in the machine learning analysis. The first was a DDT transgenerational sperm DMR set [14] and second a methoxychlor (MXC) data set [13]. The two DMR positive control data sets were combined and termed the sperm MXC-DDT DMR data set. The description of the datasets is given in Table 3. The germ cell learned classifier accurately predicted all the DMRs in the sperm MXC-DDT dataset, 100% prediction accuracy (Table 9). Prediction accuracy is defined as the number of previously identified DMR that were identified by the computational tool. In addition, a comparison of the MXC-DDT DMR with the predicted genome-wide sperm DMR showed 38% overlap with the single site comparison (
Previous studies have demonstrated a variety of environmental factors from abnormal nutrition [39-45] to toxicant exposures can promote the epigenetic transgenerational inheritance of disease susceptibility and germline (e.g., sperm) epimutations [1]. Examples include the agricultural fungicide vinclozolin [11,17], the industrial contaminant dioxin [46,47], a hydrocarbon mixture jet fuel (JP8) [16], the plastic derived compounds bisphenol A (BPA) and phthalates [15,48,49], the pesticides methoxychlor [11,13] and dichlorodiphenyltrichloroethane (DDT) [14], and permethrin and N,N-Diethyl-meta-toluamide (DEET) [12]. All these environmental exposures of a gestating female (F0 generation) during the period of fetal gonadal sex determination promoted the epigenetic transgenerational (i.e. F3 generation) inheritance of disease. The transgenerational disease observed varied between the exposures, but generally involved abnormalities in the testis (spermatogenic cell apoptosis), ovary (polycystic ovarian disease), kidney (cyst development), prostate (epithelial cell atrophy), and behavioral abnormalities including mate preference changes and anxiety [1]. Interestingly, the chromosomal locations of the transgenerational sperm epimutations were generally distinct between the different exposure lineages [18]. Therefore, the sperm were found to have an exposure specific set of epimutations [1] and the epimutations all had common genomic features of a low CpG (<10 CpG/100 bp) density (i.e., CpG deserts) [22] and unique DNA sequence motifs [23].
The current study was designed to use these various transgenerational epimutation datasets as training sets in a novel sequential machine learning approach to identify the potential genome-wide locations of transgenerational epimutations. Although previous machine learning approaches applied active learning or imbalance class learning independently, the sequential use for a biological data set is novel. The training datasets from the epigenetic transgenerational (F3 generation) inheritance of sperm epimutations from various exposure lineages included; dioxin [46], jet fuel [16], vinclozolin [16,18,19,46], plastics (BPA phthalates) [15] and pesticide (permethrin and DEET) [12,15]. These exposure specific sperm epimutation datasets were used to develop the machine learning algorithm to predict the genome-wide locations of sperm epimutations. In addition, transgenerational somatic cell epimutation datasets were used to predict genome-wide locations of potential somatic epimutations. The testicular Sertoli cell and ovarian granulosa cells were purified from adult vinclozolin lineage F3 generation tissues and these cell specific epimutations identified [20,21]. These transgenerational somatic cells epimutation datasets were then used independently as training sets in the machine learning approach to develop the algorithm for transgenerational somatic cell epimutations and compare to that of transgenerational germline epimutation predictions.
In a previous research study that looked into finding potential imprinted genes in human and mouse genomes, Jirtle and colleagues mined the mouse genome and found thousands of relevant features for machine learning prediction of potential imprinted genes [25]. Imprinted genes are parent of origin monoallelic expressed genes with critical developmental functions [50]. Mining the DNA sequence characteristics up to 100 kb upstream and downstream around known imprinted genes developed genomic features and training sets to develop a prediction algorithm [25]. They used the Equbits Foresight (www.equbits.com) classifier and predicted 722 new potential imprinted gene sites. Their study examined 23,788 annotated autosomal mouse genes and identified 600 potential mouse imprinted genes [25]. The same group later mined the human genome for new imprinted sites [26]. They again used the Equbits Foresight which uses the Support Vector Machine (SVM) classifier and 622 features and used their own SMLR (sparse multinomial logistic regression) [51] classifier with 820 features to predict novel human imprinted genes [26]. A second study by another group looked into the correlation of different genomic features in DNA methylation of CpG islands [52]. They mined features from 190 CpG islands from human chromosome 21 and tested it on the rest of the CpG islands in the genome for finding potential methylated CpG islands. A correlation among different features identified potential different methylation profiles for different tissue types and for different diseases [52]. The main difference of the proposed approach with the imprinted gene research is that active learning is used to identify a sub-group of features for each queried training example instead of using a global feature reduction [25,26]. For the second study, the main difference is that their approach looks into DNA methylation in CpG islands while the current study looks into genome wide methylation patterns including low density CpG regions, unlike dense CpG regions in CpG islands [52].
Active learning using the GQAL approach on the transgenerational sperm DHVPP epimutation was done over a 10 fold cross validation. During training GQAL found 36% of the features to be redundant and used 245 samples averaged over all iterations. Once training was complete the learning algorithm was tested on an independent test set and an accuracy of 99.2% was achieved. In contrast, for the somatic cell (SC) dataset GQAL removed 14% of the features as redundant and used 290 samples averaged over all iterations. Again after completion of training the learned classifier was tested on an independent test set and achieved an accuracy of 97.7%. This shows the power of the GQAL approach [30]. While Active Learning removes redundant features, boosting performed balanced learning on the epigenetic datasets.
Additional analysis was done to determine the predictive power of specific groups and individual genomic features. The percentage of predicted DMR that contained a feature was used for “prediction power”. For the final prediction the combined groups of features had the highest impact with 100% accuracy compared to individual features. As observed for individual features,
Once the two step training was completed the trained model was used for a genome-wide prediction. The rat genome was annotated with all the genomic features selected and the learned classifier was applied. Among the initial list of predicted 48K sites for the sperm DHVPP and 28K for somatic SG sites, after selecting only the three or more consecutive sites a final list of 3,233 sites for DHVPP germline cell and 1,502 sites for somatic cell SG remained. There are more sites in the DHVPP in part since this is a combination of five different experiments. In contrast, somatic cell SG datasets involved two individual cell types from the testis and ovaries only and the number of epimutations was less than the germ cell datasets.
The number of specific DMR that localized onto each chromosome for the somatic cell 1,502 sites and germ cell 3,233 sites was found to be comparable between chromosomes (Table 9). Chromosome 1 and 2 for both datasets show higher numbers of sites in part due to the size of these chromosomes. A cluster analysis for genomic regions with a statistically significant over-representation of predicted DMR identified a number of clusters on each chromosome (
Interestingly, the predicted germ cell DMR and somatic DMR were distinct with negligible overlap (
A partial validation of the novel machine learning approach and predicted genome-wide germ cell DMR used recently identified sperm DMR not used as training data sets. The transgenerational sperm epimutations from DDT [14] and methoxychlor [13] lineage F3 generation animals were combined and used as a positive validation DMR data set termed MXC-DDT. Since these are independently identified transgenerational sperm DMR, they should appear in the transgenerational machine learning predicted genome-wide sperm DMR data set. The analysis showed 100% prediction accuracy of the MXC-DDT DMR being selected by the machine learning algorithm when used as a training set. The MXC-DDT DMR were found to have a 38% overlap with the single sites in comparison with the predicted sperm DMR dataset (
The novel machine learning approach utilized a sequential generalized query based active learning and imbalance class learning on epigenetic data sets. Some studies have applied machine learning to epigenetics [25,26]. However, the machine learning approach developed can be used to increase the accuracy and efficiency of the prediction of machine learning with any biological dataset or any dataset for that matter. The advantage to this novel sequential machine learning approach is better accuracy through balancing the datasets and then using optimal features to train the classifier and increase efficiency. The current approach used a tandem sequential process, but the the active and imbalance learning can be combined into a single process. Broader use of this approach is anticipated to improve the specific machine learning tool developed and enhance machine learning applications.
A variety of different environmental exposures [1] have been shown to induce the epigenetic inheritance of disease and phenotypic variation in species ranging from plants, flies, worms, fish, rodents, pigs and humans [1,11,43,63-67]. The germline transmission of altered epigenetic information is the mechanism behind this non-genetic form of inheritance [9]. Differential DNA methylated regions (DMRs) are in part the epigenetic mechanism of epigenetic inheritance [1]. Previous studies have demonstrated the DMRs termed epimutations identified are exposure specific [18] and correlate to later life disease susceptibility [1]. A variety of different disease conditions, behavioral alterations and phenotypic variation is associated with the epigenetic transgenerational inheritance phenomenon [1]. Identification of DMR or epimutations associated with ancestral or early life exposures correlates to later life disease [18]. A number of studies have demonstrated the feasibility of these epigenetic biomarkers that could be used as early stage diagnostics for disease susceptibility [1]. The current study used a novel sequential machine learning approach to predict the potential susceptible DMR and epimutation sites in the genome. This information and datasets can now be used to more effectively identify the patterns or signatures of DMR associated with specific exposures and disease conditions.
In addition to the prediction of the genome-wide DMR and potential epimutations, the novel machine learning tool also provides critical information regarding the essential genomic molecular features of the DMR. The most important was the low density CpG regions or CpG deserts (
While the invention has been described in terms of its preferred embodiments, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims. Accordingly, the present invention should not be limited to the embodiments as described above, but should further include all modifications and equivalents thereof within the spirit and scope of the description provided herein.
Number | Date | Country | |
---|---|---|---|
62252600 | Nov 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15343516 | Nov 2016 | US |
Child | 16888922 | US |