The present disclosure relates to methods and systems for predicting and detecting congenital heart defects in patients using molecular markers.
Epigenetic changes including DNA methylation are known to be involved in cardiac embryogenesis (O'Meara & Lee 2015) and in the development of congenital heart defects (Bahado-Singh et al. 2005; Bahado-Singh et al. 2016; Radhakrishna et al. 2016; Radhakrishna et al. 2019). With very few exceptions, cardiac tissue is not available for research analysis in living subjects. This is particularly the case for the developing fetus and this reality has retarded progress on the evaluation of epigenetic changes that are causal or linked to the development of congenital heart defect (CHD). As a consequence, there is significant interest in developing molecular markers in tissues such as blood leukocytes that also reflect epigenetic changes and are biologically linked or correlated with those in sequestered organs such as the developing heart. Many studies (Bahado-Singh et al. 2005; Bahado-Singh et al. 2016; Radhakrishna et al. 2016; Radhakrishna et al. 2019) have shown that the above objective might be achievable. Methylation of newborn leukocyte DNA obtained by heel stick has been used in these studies to elucidate the mechanisms of multiple different types of the common non-syndromic CHD. Also, using conventional statistical analysis and also Artificial Intelligence (AI) approaches, leukocyte DNA methylation accurately predicted different types of non-syndromic congenital heart defects (CHDs) (Bahado-Singh et al. 2019b).
Extensive publications, both clinical and laboratory (Burton & Jauniaux 2018; Maslen 2018) indicate that the placenta and its vascularity play a critical role in cardiac embryogenesis and CHD development. Indeed disorders such as maternal hypertension which can affect placental morphology and vascularity have been shown on meta-analysis to significantly increase the risk of CHD in the offspring of such pregnancies (Ramakrishnan et al. 2015; van Gelder et al. 2015). The placenta is an organ that is available in abundance for postnatal analysis.
In DNA methylation a single carbon atom or so-called ‘methyl group’ is transferred to and covalently bound to position #5 of the cytosine nucleotide ring of the cytosine-guanine (‘C-G’ or ‘CpG’) dinucleotide. This process converts the cytosine base to 5-methylcytosine (5mC). Classically, DNA methylation, particularly when it occurs in the promoter region of the gene which has a high number of CpG repeats, results in suppression of gene transcription or gene silencing. The 5mC can undergo further chemical modification to 5-hydroxymethylcytosine (5hmC). This hydroxymethylation was discovered to occur through the actions of a group of enzymes called ten-eleven translocation (TET) proteins. These are a group of three dioxygenases that catalyze the conversion of 5mC to 5hmC (Tahiliani et al. 2009).
Since 5hmC results from the conversion of 5mC, it is thought to be a mechanism for eliminating the former. The accumulation of 5hmC is linked to the regulation of gene transcription. Unlike 5mC, 5hmC binds much less avidly to gene repressor proteins such as the methyl-CpG-binding proteins such as MBD1, MBD2, and MBD4 (Jin S G, Kadam S, Pfeifer G P. Examination of the specificity of DNA methylation profiling techniques towards 5-methylcytosine and 5-hydroxymethylcytosine. Nucleic Acid Res 2010; 38:e125). Thus, beyond the measurement of 5mC concentration simultaneous information on 5hmC would provide more detailed epigenetic information and more precise information on gene transcription. 5hmC levels were found to be enhanced in the gene bodies of genes that are transcriptionally active (Nester et al. 2012) thus having an opposite effect on gene expression compared to 5mC.
Secondly, the concentration of 5hmC appears to be an important mechanism of tissue differentiation and a marker of tissue type (Nester et al. 2012). A major limitation of studies of cell-free fetal (cfF) DNA is the heavy contamination of the ‘fetal’ (more precisely placental DNA which is a fetal organ) with maternal DNA spilling from hemolyzed leukocytes. As a consequence, cfF DNA in the maternal circulation accounts for approximately 10-20% of cell-free (cf) DNA in pregnancy (Rafi et al. 2017). 5hmC levels are low in maternal leukocytes that that measuring cf fetal DNA 5hmC levels should improve the specificity of identifying fetal and/or placental cf DNA in maternal circulation.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify all key features or essential features of the claimed subject matter, nor is it intended to be used alone as an aid in determining the scope of the claimed subject matter.
Foundational to Precision Medicine is the integration of Artificial Intelligence (AI), big data, omics technology, and the development of blood tests, to elucidate the pathogenesis of and accurately detect complex disorders (Mesko B, 2017). Epigenomic analysis of cell-free fetal DNA (cfF DNA) in combination with AI techniques was used to interrogate the pathogenesis of and to detect fetal non-syndromic CHD. In this prospective study, cfF DNA was extracted from maternal blood and whole-genome DNA methylation analysis was performed using the Illumina Infinium MethylationEPIC BeadChip arrays. Six different AI platforms including Deep Learning (DL), the most recent AI approach, were used for CHD detection. Ingenuity Pathway Analysis based on gene loci that were significantly differentially methylated, was used to investigate the molecular basis of CHD.
There were a total of 12 cases of isolated CHD and 26 controls. AI accurately detected CHD. Using RF, the combination of five CpG markers had an AUC AUC (95% CI)=0.98 (0.83−1) with 97% sensitivity and 93% specificity followed by SVM model with AUC (95% CI)=0.97 (0.87−1) with 98% sensitivity and 94% specificity for CHD detection. Logistic regression with cross-validation using CpG markers and a prior history of a CHD fetus had an AUC (95% CI)=0.98 (0.98−0.97) with 100% sensitivity and 85% specificity for CHD. Epigenetic dysregulation of genes and gene pathways involved in cardiogenesis and cardiac anomaly development were found in non-syndromic CHD. This provides biological confirmation or plausibility that the findings were not random occurrences.
The method described herein accurately detects fetal CHD using AI analysis of cfF DNA in maternal blood. Applicant's data provide further evidence of the role of epigenetic dysfunction in CHD non-syndromic development. Applicant's findings represent further progress in the development of fetal cardiovascular precision medicine.
Birth defect, i.e. abnormalities developing in fetal life and present at birth, is the major cause of infant death, defined as death within a year of birth, in the USA. CHDs occur with a frequency of 8-9 cases per 1,000 live births. CHD is the most common group of severe birth defects and is the costliest in terms of hospitalization. Up to 25% of cases with major CHD in newborns are not diagnosed prior to discharge from the hospital.
Heart development in embryonic and fetal life requires the coordination and orchestration of a large number of different genes. A relatively small percentage of CHD cases is known to be related to gene mutations which are changes in the normal sequence in which the basic building block (“nucleotides”) are arranged in the DNA of the gene. Such mutations lead to malfunctioning or non-functioning of genes (i.e. altered amounts, of or the production of abnormal types of proteins) that are important for normal heart development.
In the last six decades, an important mechanism for controlling gene function called “epigenetics” has been discovered and extensively investigated. The term “epigenetics” can be used to describe the interaction between genes and the environment. These interactions do not result in changes to the genome sequence itself (no nucleotide sequence changes) but changes gene expression which still account for variations in phenotypic expression. Epigenetics is defined as heritable (i.e. passed onto offspring) changes in gene expression of cells that are not primarily due to mutations or changes in the sequence of nucleotides (adenine, thiamine, guanine, and cytosine) in the genes. Epigenetics is a reversible regulation of gene expression by several potential mechanisms. One such mechanism which is the most extensively studied is DNA methylation. Other mechanisms include changes in the 3-dimensional structure of the DNA, histone protein modification, and micro-RNA inhibitory activity. The epigenetic mechanisms are known to be extensively inter-related.
Cytosine refers to one of a group of four building blocks “nucleotides” from which DNA is constructed. The chemical structure of cytosine is in the form of a pyrimidine ring. Apart from cytosine, the other nucleotides or building blocks found in DNA are thiamine, adenine, and guanosine.
The term methylation refers to the enzymatic addition of a “methyl group” or single carbon atom to position #5 of the pyrimidine ring of cytosine which leads to the conversion of cytosine to 5-methyl-cytosine. The methylation of cytosine as described is accomplished by the actions of a family of enzymes named DNA methyltransferases (DNMT's). The 5-methyl-cytosine when formed is prone to mutation or the chemical transformation of the original cytosine to form thymine. Five-methyl-cytosines account for about 1% of the nucleotide bases overall in the normal genome.
The term hypermethylation refers to increased frequency or percentage methylation at a particular cytosine locus when specimens from an individual or group of interest are compared to a normal or control group.
Cytosine is usually paired with guanosine another nucleotide in a linear sequence along the single DNA strand to form CpG pairs. “CpG” refers to a cytosine-phosphate-guanosine chemical bond in which phosphate binds the two nucleotides together. In mammals, in approximately 70-80% of these CpG pairs the cytosine is methylated (Chatterjee R, Vinson C. Biochemica et Biophisica Acta 2012; 1819:763-70). The term “CpG island” refers to regions in the genome with a high concentration of CG dinucleotide pairs or CpG sites. “CpG islands” are often found close to genes in mammalian DNA. The length of DNA occupied by the CpG island is usually 300-3000 base pairs. The CG cluster is on the same single strand of DNA. The CpG island is defined by various criteria including i) the length of recurrent CG dinucleotide pairs occupying at least 200 bp of DNA and ii) a CG content of the segment of at least 50% along with the fact that the observed/expected CpG ratio should be greater than 60%. In humans, about 70% of the promoter regions of genes have high CG content. The CG dinucleotide pairs may exist elsewhere in the gene or outside of a gene and not know to be associated with a particular gene.
Approximately 40% of the promoter region (region of the gene which controls its transcription or activation) of mammalian genes has associated CpG islands and three-quarters of these promoter-regions have high CpG concentrations. Overall in most CpG sites scattered throughout the DNA, the cytosine nucleotide is methylated. In contrast, in the CpG sites located in the CpG islands of promoter regions of genes, the cytosine is unmethylated suggesting a role of the methylation status of cytosine in CpG Islands in gene transcriptional activity.
The methylation of cytosines associated with or located in a gene is classically associated with suppression of gene transcription. In some genes, however, increased methylation has the opposite effect and results in activation or increased transcription of a gene. One potential mechanism explaining the latter phenomenon is that methylation of cytosine could potentially inhibit the binding of gene suppressor elements thus releasing the gene from inhibition. Epigenetic modification, including DNA methylation, is the mechanism by which cells that contain identical DNA and genes experience the activation of different genes and result in the differentiation into unique tissues e.g. heart or intestines.
The present disclosure describes the use of epigenomic and Artificial Intelligence analytic techniques for accurate diagnosis or prediction of CHD, including CHD of prenatal and/or pediatric subjects, based on detecting cytosine methylation of nucleic acids of subjects. Methylation profiling was performed using Illumina Infinium arrays with over 850 k methylation markers according to the manufacturer's instructions. Methylation levels of CpG sites across the genome were examined in 12 CHD cases and compared to 26 of unaffected healthy matched controls. Pathway analysis was performed using Ingenuity pathway analysis to elucidate the mechanism of the disorder. In addition, the diagnostic accuracy of epigenomic markers for the detection of CHD was determined. The area under the receiver operating characteristics (AUC) curves and 95% CI and FDR p-values were calculated for the detection of CHD.
Several different Artificial Intelligence (AI) techniques including Deep Learning (DL), the newest form of AI, were used to predict CHD using i) epigenetic i.e. DNA methylation markers and ii) clinical and demographic markers.
In embodiments, the present disclosure describes a method for diagnosing CHD based on measurement of frequency or percentage methylation of cytosine nucleotides in various identified loci in a nucleic acid sample. In embodiments, the nucleic acid sample can be obtained from a biological sample of a patient in need thereof. The method includes obtaining a biological sample from a patient; extracting nucleic acid from the sample; assaying the sample to determine the percentage methylation of cytosine at loci throughout the genome; comparing the cytosine methylation level of the patient to a control; and calculating the individual risk of being diagnosed with CHD based on the cytosine methylation level at different sites throughout the genome. The control can be one or more characterized or known cases and/or a characterized or known group.
The methods described herein include obtaining nucleic acid from biological samples of a subject. The subject is an individual or a patient in need of (or in need there) diagnosis or is experiencing symptoms of CHD. The subject can also be undergoing routine screening. Examples of subjects include such as from an adult, a pediatric patient, an embryo, or a fetus. An “embryo” refers to the patient from the time of fertilization to the end of the eighth week of gestation. A “fetus” refers to the patient after the eighth week of gestation.
In embodiments, the patients could be adults and the control could be a well-characterized group of normal (healthy) people and/or a well-characterized population of CHD patients.
In embodiments, the patient could be a pediatric patient. The pediatric patient can be less than about 19 years old, about 15 to 19 years old, less than about 15 years old, about 10 to 15 years old, less than 10 years old (childhood), about 5 to 10 years old, less about 4 years old, about 1 to 4 years old, less than one-year-old (infant), or 28 days or less after birth (newborn or neonatal period). The patient can be an in utero patient, for example, an embryo or a fetus. When the patient is an embryo or fetus, the DNA can be obtained from a biological sample from the mother, the pregnant woman, carrying the embryo or fetus. The biological sample can be obtained from a pregnant woman in her first trimester, second trimester, or third trimester.
The control for pediatric patients could be a well-characterized group of normal (healthy) children of less than about 19 years old and/or a well-characterized population of CHD pediatric patients. Likewise, the control for the in utero patient could be a well-characterized group of normal (healthy) in utero patients and/or a well-characterized population of CHD in utero patients.
The well-characterized group of normal people (including adult, pediatric, and in utero patients) or CHD patients may include one or more normal people or CHD patients or may include a population of normal people or CHD patients.
The biological sample can be a body fluid such as blood, plasma, serum, urine, saliva, sputum, sweat, breath condensate, tears, genital secretion including cervical secretion, amniotic fluid, and umbilical cord blood obtained at birth. The biological sample can be a cervical swab for cell-free nucleic acid or exfoliated trophoblast cells, skin, hair, follicles/roots, and mucous membranes (cheek aka buccal scrapings or scrapings from the tongue). The biological sample can also include any internal body tissue of the patient, such as any tissue samples obtained from the patient including placental tissue from the newborn period or during fetal life. The placental tissue from an embryo or fetus can be obtained by placental biopsy or chorionic villus sampling (CVS). In embodiments, the biological sample can also include specimen from CVS. In embodiments, biological samples from a mother can include maternal blood, placenta, amniotic fluid, other body fluids during pregnancy, or other maternal body fluids.
Cells and nucleic acid from any biological samples which contain DNA can be used in the methods described herein for diagnosing and predicting CHD. Samples used for testing can be obtained from living or dead tissue and also archeological specimens containing cells or tissues.
The nucleic acid used in the method described herein can be obtained from cells. In embodiments, the nucleic acid includes fetal nucleic acid obtained directly from the fetus or the embryo, such as from the placenta or amniotic fluid by amniocentesis, or obtained from maternal body fluids or placental tissue, circulating fetal cells harvested from the maternal circulation, exfoliated placental cells, or cfF DNA in maternal circulation. In embodiments, the nucleic acid is obtained from amniotic fluid, fetal blood, or cord blood obtained during fetal life or at birth. In embodiments, the nucleic acid is DNA or RNA.
Since all cells, with few exceptions (mature red blood cells and mature platelets), contain nuclei and therefore DNA, the method described herein can be used to screen for CHD using DNA from any cells except for the two named above. Thus, any biological or tissue sample containing cells that contain nucleic acid such as DNA can be used in the method described herein. In addition, cell-free DNA released from cells that have been destroyed and which can be retrieved from body fluids can be used for such screening.
Cell-Free DNA (cf DNA). The nucleic acid can be DNA existing in the form of cf DNA. Cell-free DNA refers to DNA that has been released from cells as a result of natural cell death/turnover or as a result of disease processes. The cf DNA is released into the circulation and rapidly broken down into DNA fragment and can ultimately end up in other body fluids, such as urine. The techniques for the harvesting of cf DNA from the blood and other body fluids is well known in the arts (Li Y et al. Size separation of circulatory DNA in maternal plasma permits ready detection of fetal DNA polymorphisms. Clin Chem 2004; 50:1002-1011; Zimmerman B et al. Noninvasive prenatal aneuploidy testing of chromosomes 13, 18, 21, X, and Y, using targeted sequencing of polymorphic loci. Prenat Diagn 2012; 32:1233-41).
Cell-free DNA separation technique that was used coats and stabilizes maternal leukocytes preventing breakdown, release, and contamination from further maternally derived leukocyte DNA. To further enhance and target analysis of placental/fetal cf DNA rather than cf DNA originating from other maternal tissues, targeting techniques that can identify the tissue source of the cf DNA can be performed. An example includes the study of Lehmann-Werman R et al. (Lehmann-Werman et al. 2016). Using existing methylome databases they identified tissue-specific methylation patterns of cf DNA. Cell-free DNA was obtained from blood donors and based on the methylation compared this to the methylome databases and they were thus able to determine the tissue of origin of circulation cf DNA fragments e.g. pancreatic β-cells and oligodendrocytes from the brain.
5hmC level is tissue-specific, and beyond its reported correlation with gene expression levels. Beyond the gene expression information provided by 5hmC level in cf DNA, additional information of tissue origin of cf DNA opens up the ability to obtain DNA and epigenetic data from different organs based on a blood test. Studies (Nester et al. 2012), indicate that the level of 5hmC varies significantly between tissues. The levels of 5hmC are very low in the DNA of blood cells while comparatively high in placental tissue, and even higher in brain. Thus, measuring 5hmC in cf fetal DNA (of placental origin) would greatly redress the issue of contamination by maternal leucocyte cfF DNA that could be a concern when only 5mC levels are measured in cfF DNA. This is so because the levels of 5hmC in leukocytes are so low compared to cfF DNA derived from the placenta, the object of interest in this CHD proposal. Overall, therefore, the use of 5mC and 5hmC measurements of cf DNA targeted epigenomic analysis of placental tissue can be performed for the detection and to provide further enhanced mechanistic information on CHD and other pregnancy disorders in which the placenta is affected.
Methylation Assays. Several quantitative methylation assays are available. These include COBRA™ which uses methylation sensitive restriction endonuclease, gel electrophoresis and detection based on labeled hybridization probes. Another available technique is the Methylation Specific PCR (MSP) for amplification of DNA segments of interest. This is performed after sodium ‘bisulfite’ conversion of cytosine using methylation sensitive probes. MethyLight™, a quantitative methylation assay-based uses fluorescence based PCR. Another method used is the Quantitative Methylation (QM™) assay, which combines PCR amplification with fluorescent probes designed to bind to putative methylation sites. Ms-SNuPE™ is a quantitative technique for determining differences in methylation levels in CpG sites. As with other techniques, bisulfite treatment is first performed leading to the conversion of unmethylated cytosine to uracil while methylcytosine is unaffected. PCR primers specific for bisulfite converted DNA is used to amplify the target sequence of interest. The amplified PCR product is isolated and used to quantitate the methylation status of the CpG site of interest. The preferred method of measurement of cytosine methylation is the Illumina method.
Illumina Method. For DNA methylation assay the Illumina Infinium® Human Methylation 450 Beadchip or Illumina Infinium MethylationEPIC BeadChip assay can be used used for quantitative methylation profiling. Briefly nucleic acid, for example, genomic DNA, is obtained. Using techniques widely known in the trade, the nucleic acid is isolated using commercial kits. Proteins and other contaminants were removed from the nucleic acid using proteinase K. The nucleic acid is removed from the solution using available methods such as organic extraction, salting out, or binding the DNA to a solid phase support.
Methylation Analysis-Illumina's Infinium Human Methylation 450 Bead Chip system or Ilumina Infinium MethylationEPIC BeadCHip arrays can be used for genome-wide methylation analysis. Nucleic acid, such as DNA, (500 ng) is subjected to bisulfite conversion to deaminate unmethylated cytosines to uracil with the EZDNA Methylation Gold kit or EZ-96 Methylation Kit (Zymo Research) using the standard protocol for the Infinium assay. The DNA is enzymatically fragmented and hybridized to the Illumina BeadChips. BeadChips contain locus-specific oligomers and are in pairs, one specific for the methylated cytosine locus and the other for the unmethylated locus. A single base extension is performed to incorporate a biotin-labeled ddNTP. After fluorescent staining and washing, the BeadChip is scanned and the methylation status of each locus is determined using BeadStudio software (Illumina). Experimental quality was assessed using the Controls Dashboard that has sample-dependent and sample-independent controls target removal, staining, hybridization, extension, bisulfite conversion, specificity, negative control, and non-polymorphic control. The methylation status is the ratio of the methylated probe signal relative to the sum of methylated and unmethylated probes. The resulting ratio indicates whether a locus is unmethylated (0) or fully methylated. Differentially methylated sites are determined using the Illumina Custom Model and filtered according to p-value using 0.05 as a cutoff.
Bisulfite Conversion. As described in the Infinium® Assay Methylation Protocol Guide, nucleic acid is treated with sodium bisulfite which converts unmethylated cytosine to uracil, while the methylated cytosine remains unchanged. The bisulfite converted nucleic acid is then denatured and neutralized. The denatured nucleic acid is then amplified. Bisulfite based analysis, the current technique for differentiating methylated from unmethylated cytosine, does not distinguish 5mC from 5hmC. New techniques include but not limited to thin-layer chromatography assay (Kriaucionis et al. 2009), and chemical tagging of 5hmC (Song et al. 2011), immunoprecipation (Nester et al., 2012), and more recently commercially available 5hmC whole exome and even whole-genome sequencing techniques can be used to provide detailed information on epigenetic changes in cfF DNA.
In embodiments, using the Illumina Infinium Assays for whole-genome (using genomic DNA) methylation studies, significant differences in the frequency (level or percentage) of methylation of specific cytosine nucleotides associated with particular genes were demonstrated in the CHD group when compared to a normal group. The differences in cytosine methylation levels are highly significant and of sufficient magnitude to accurately distinguish the CHD from the normal group. Thus, the methods described herein can be used as a test to screen for CHD cases among a mixed population with CHD and normal cases.
The whole-genome application process increases the amount of DNA by up to several thousand-fold. The next step uses enzymatic means to fragment the DNA. The fragmented DNA is next precipitated using isopropanol and separated by centrifugation. The separated DNA is next suspended in a hybridization buffer. The fragmented DNA is then hybridized to beads that have been covalently limited to 50mer nucleotide segments at a locus-specific to the cytosine nucleotide of interest in the genome. There is a total of over 500,000 bead types specifically designed to anneal to the locus where the particular cytosine is located. The beads are bound to silicon-based arrays. There are two bead types designed for each locus, one bead type represents a probe that is designed to match to the methylated locus at which the cytosine nucleotide will remain unchanged. The other bead type corresponds to an initially unmethylated cytosine which after bisulfite treatment is converted to a thiamine nucleotide. Unhybridized (not annealed to the beads) DNA is washed away leaving only DNA segments bound to the appropriate bead and containing the cytosine of interest. The bead-bound oligomer, after annealing to the corresponding patient DNA sequence, then undergoes single base extension with fluorescently labeled nucleotide using the ‘overhang’ beyond the cytosine of interest in the patient DNA sequence as the template for extension.
If the cytosine of interest is unmethylated then it will match perfectly with the unmethylated or “U” bead probe. This enables single base extensions with fluorescent labeled nucleotide probes and generates fluorescent signals for that bead probe that can be read in an automated fashion. If the cytosine is methylated, single base mismatch will occur with the “U” bead probe oligomer. No further nucleotide extension on the bead oligomer occurs however thus preventing incorporation of the fluorescent tagged nucleotides on the bead. This will lead to low fluorescent signal form the bead “U” bead. The reverse will happen on the “M” or methylated bead probe.
Laser is used to stimulate the fluorophore bound to the single base used for the sequence extension. The level of methylation at each cytosine locus is determined by the intensity of the fluorescence from the methylated compared to the unmethylated bead. Cytosine methylation level is expressed as “β” which is the ratio of the methylated bead probe signal to total signal intensity at that cytosine locus. These techniques for determining cytosine methylation have been previously described and are widely available for commercial use.
The present disclosure describes the use of a commercially available methylation technique to cover up to 99% Ref Seq genes involving approximately 16,000 genes and 450,000 cytosine nucleotides down to the single nucleotide level, throughout the genome (Infinium Human Methylation 450 Beach Chip Kit or Infinium MethylationEPIC BeadChip). The frequency of cytosine methylation at single nucleotides in a group of CHD cases compared to controls is used to estimate the risk or probability of being diagnosed with CHD. The cytosine nucleotides analyzed using this technique included cytosines within CpG islands and those at further distances outside of the CpG islands i.e. located in “CpG shores” and “CpG shelves” and even more distantly located from the island so called “CpG seas”.
The cytosine evaluated as described herein includes but are not limited to cytosines in CpG islands located in the promoter regions of the genes. Other areas targeted and measured include the so called CpG island ‘shores’ located up to 2000 base pairs distant from CpG islands and ‘shelves’ which is the designation for DNA regions flanking shores. Even more distant areas from the CpG islands so called “seas” were analyzed for cytosine methylation differences. The extragenic cytosine loci, located outside of known genes (however they could potentially maintain long-distance control of unspecified genes) also detected CHD with moderate, good and excellent accuracy as indicated.
Identification of Specific Cytosine Nucleotides. Reliable identification of specific cytosine loci distributed throughout the genome has been detailed (Illumnia) in the document: “CpG Loci Identification. A guide to Illumina's method for unambiguous CpG loci identification and tracking for the GoldenGate® and Infinium™ assays for Methylation.” A brief summary follows. Illumina has developed a unique CpG locus identifier that designates cytosine loci based on the actual or contextual sequence of nucleotides in which the cytosine is located. It uses a similar strategy as used by NCBI's re SNP IPS (rs #) and is based on the sequence flanking the cytosine of interest. Thus, a unique CpG locus cluster ID number is assigned to each of the cytosine undergoing evaluation. The system is reported to be consistent and will not be affected by changes in public databases and genome assemblies. Flanking sequences of 60 bases 5′ and 3′ to the CG locus (i.e. a total of 122 base sequences) are used to identify the locus. Thus, a unique “CpG cluster number” or cg # is assigned to the sequence of 122 bp which contains the CpG of interest. The cg # is based on Build 37 of the human genome (NCB137). Accordingly, only if the 122 bp in the CpG cluster is identical is there a risk of a locus being assigned the same number and being located in more than one position in the genome. Three separate criteria are utilized to track individual CpG locus based on this unique ID system. Chromosome number, genomic coordinate and genome build. The lesser of the two coordinates “C” or “G” in CpG is used in the unique CG loci identification. The CG locus is also designated in relation to the first ‘unambiguous” pair of nucleotides containing either an ‘A’ (adenine) to ‘T’ (thiamine). If one of these nucleotides is 5′ to the CG then the arrangement is designated TOP and if such a nucleotide is 3′ it is designate BOT.
In addition, the forward or reverse DNA strand is indicated as being the location of the cytosine being evaluated. The assumption is made that methylation status of cytosine bases within the specific chromosome region is synchronized.
Cytosine Methylation for the diagnosing CHD Using ROC Curve. To determine the accuracy of the methylation level of a particular cytosine locus for CHD prediction, different threshold levels of methylation e.g. 10%, ≥20%, ≥30%, ≥40% etc. at the site was used to calculate sensitivity and specificity for CHD diagnosis. Thus, for example using 0% methylation at a particular cg locus, cases with methylation levels above this threshold would be considered to have a positive test and those with lower than this threshold are interpreted as a negative methylation test. The percentage of CHD cases with a positive test in this example, 10% methylation at this particular cytosine locus, would be equal to the sensitivity of the test. The percentage of normal (non-CHD) cases with cytosine methylation levels of ≤10% at this locus would be considered the specificity of the test. False positive rate is here defined as the number of normal cases with a (falsely) abnormal test result and sensitivity is defined as the number of CHD cases with (correctly) abnormal test result e.g. the level of methylation 10% at this particular cg location. A series of threshold methylation values are evaluated e.g. ≥ 1/10, ≥ 1/20, ≥ 1/30 etc., and used to generate a series of paired sensitivity and false positive values for each locus. A receiver operating characteristic (ROC) curve which is a plot of data points with sensitivity values on the Y-axis and false positivity rate on the X-axis is generated. This approach can be used to generate ROC curves for each individual cytosine locus that displays significant methylation differences between cases and CHD groups. In this instance the computer program ROCR package-version 3.4 ((https://CRAN.R-project.org/package=ROCR) was used to generate the area under the ROC curves.
The ROC curve is a graph plotting sensitivity—defined in this setting as the percentage of CHD cases with a positive test or abnormal cytosine methylation levels at a particular cytosine locus on the Y axis and false positive rate (1-specificity or 100%—specificity, when the latter is expressed as a percentage)—i.e. the number of normal (non-CHD) cases with abnormal cytosine methylation at the same locus—on the X-axis. Specificity is defined as the percentage of normal (non-CHD) cases with normal methylation levels at the locus of interest or a negative test. False positive rate refers to the percentage of normal individuals falsely found to have a positive test (i.e. abnormal methylation levels); it can be calculated as 100-specificity (%) or expressed as a decimal format [1-specificity (expressed as a decimal point)].
The area under the ROC curves (AUC) indicates the accuracy of the test in identifying normal from abnormal cases. The AUC is the area under the ROC plot from the curve to the diagonal line from the point of intersection of the X- and Y-axes and with an angle of incline of 45°. The higher the area under ROC curve the greater is the accuracy of the test in predicting the condition of interest. An area under the ROC=1.0 indicates a perfect test, which is positive (abnormal) in all cases with the disorder and negative in all normal cases (without the disorder). Methylation assay refers to an assay, a large number of which are commercially available, for determining the level of methylation at a particular cytosine in the genome. In this particular context, this approach can be used to distinguish the level of methylation in affected cases (CHD) compared to unaffected controls.
Logistic regression analysis can be used for calculation of sensitivity and specificity for the prediction of CHD based on methylation of cytosine loci.
Standard statistical testing using p-values to express the probability that the observed difference between cytosine methylation at a given locus between CHD and control DNA specimens can be performed. More stringent testing of statistical significance using False Discovery Rate (FDR) for multiple comparisons was also performed. The FDR gives the probability that positive results were due to chance when multiple hypothesis testing is performed using multiple comparisons.
Statistical Analyses. The present disclosure describes a method for predicting, diagnosing, detecting CHD in a subject, and/or calculating the risk of the subject in being diagnosed with CHD or even a particular type of CHD. This calculation can be based on logistic regression analysis leading to the identification of the significant independent predictors among a number of possible predictors (e.g. methylation loci) known to be associated with CHD or increased risk of being diagnosed with CHD. Cytosine methylation levels at different loci can be used by themselves or in combination with other known risk predictors for CHD, such as prenatal exposure to toxins—“yes” or “no” (e.g. alcohol or maternal smoking, maternal diabetes, family history combined with methylation levels in a single or multiple loci) which are known to be associated with increased risk of CHD as described in this application. For example, the probability of an individual being affected can be derived from the probability equation based on the logistic regression:
P
CHD=1/1+e(B1×1+B2×2+B3×3 . . . Bn×n)
where ‘x’ refers to the magnitude or quantity of the particular predictor (e.g. methylation level at a particular locus) and “β” or β-coefficient refers to the magnitude of change in the probability of the outcome (e.g., CHD) for each unit change in the level of the particular predictor (x), the β values are derived from the results of the logistic regression analysis. These β values would be derived from multivariable logistic regression analysis in a large population of affected and unaffected individuals. Values for x1, x2, x3, etc, representing in this instance methylation percentage at different cytosine locus would be derived from the individual being tested while the β-values would be derived from the logistic regression analysis of the large reference population of affected (CHD) and unaffected cases mentioned above. Based on these values, an individual's probability of having a type of CHD can be quantitatively estimated. Probability thresholds are used to define individuals at high risk (e.g. a probability of ≥ 1/100 of CHD may be used to define a high risk individual triggering further evaluation such as an one or more of the following: echocardiograms, pulse oximetry measurements at birth and the like, while individuals with risk < 1/100 would require no further follow-up. The threshold used will among other factors be based on the diagnostic sensitivity (number of CHD cases correctly identified), specificity (number of non-CHD cases correctly identified as normal), risk and cost of ECHOcardiogram and related interventions pursuant to the designation of an individual as “high risk” for CHD and such factors. Logistic regression analysis is well known as a method in disease screening for estimating an individual's risk for having a disorder. (Royston P, Thompson S G. Model-based screening by risk with application in Down's syndrome. Stat Med 1992; 11:257-68.)
Individual risk of CHD can also be calculated by using methylation percentages (reported as β-coefficients) at the individual discriminating cytosine locus by themselves or using different combinations of loci based on the method of overlapping Gaussian distribution or multivariate Gaussian distribution (Wald N J, Cuckle H S, Deusem J W et al (1988) Maternal serum screening for down syndrome in early pregnancy. BMJ 297, 883-887.) where the variable would be methylation level/percentage methylation at a particular (or multiple) loci so called. Alternatively if methylation percentages or β-coefficients are not normally distributed (i.e. non-Gaussian), normal Gaussian distribution would be achieved if necessary by logarithmic transformation of these percentages.
As an example, two Gaussian distribution curves are derived for methylation at particular loci in the CHD group and the normal populations. Mean, standard deviation and the degree of overlap between the two curves are then calculated. The ratio of the heights of the distribution curves at a given level of methylation will give the likelihood ratio or factor by which the risk of having CHD is increased (or decreased) at a particular level of methylation at a given locus. The likelihood ratio (LR) value can be multiplied by the background risk of CHD in the general population and thus give an individual's risk of CHD based on methylation level at the CG site(s) chosen. Information on the background population risk of CHD in the newborn population is available from several sources (one such example is Hoffman J L et al Am Heart J 2004; 147:425-439). Similar information is available for prenatal and later postnatal life.
Artificial Intelligence (AI). One or more AI algorithms can be used in combination with the methods described herein to improve the accuracy for predicting and/or diagnosing CHD. Representative examples of AI algorithms include Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Prediction of Analysis for Microarrays (PAM), Generalized Linear Model (GLM), and deep learning (DL).
Random Forest (RF) is a supervised classification algorithm used for regression, classification and other tasks. Multiple decision tree predictive models are randomly generated in the training phase and the mode of the classes and mean prediction of the individual trees are generated as outputs. There is a direct relationship between the number of trees in the forest and the results it can get: the larger the number of trees, the more accurate the result. The difference between Random Forest algorithm and the decision tree algorithm is that in Random Forest, the processes of finding the root node and splitting the feature nodes will run randomly. The decision tree is a decision support tool that uses a tree-like graph to show the possible consequences. If one inputs a training dataset with targets and features into the decision tree, it will formulate a set of rules. Overfitting is one critical problem that may make the results worse in decision trees, but for Random Forest algorithm, if there are enough trees in the forest, the classifier won't overfit the model. Another advantage is the classifier of Random Forest can handle missing values, and the last advantage is that the Random Forest classifier can be modeled for categorical values.
Support vector machine (SVM) is primarily a classifier method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels. SVM supports both regression and classification tasks and can handle multiple continuous and categorical variables. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (a list of p numbers), and we want to know whether we can separate such points with a (p−1)-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. We choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier; or equivalently, the perceptron of optimal stability.
Linear Discriminant Analysis (LDA) is a classification method originally developed in 1936 by R. A. Fisher. It is simple, mathematically robust, and often produces models whose accuracy is as good as more complex methods. LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets). It is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements.
Prediction Analysis for Microarrays (PAM) is a statistical technique for class prediction from gene expression data using the nearest shrunken centroids. This method identifies the subsets of genes that best characterize each class.
Generalized Linear Models (GLMs) are a broad class of models that include, for example, linear regression, ANOVA, Poisson regression, log-linear models, but there are some limitations of GLMs, such as, linear function, e.g. can have only a linear predictor in the systematic component, and responses must be independent.
Generally, classical machine learning techniques make predictions directly from a set of features that have been pre-specified by the user. However, representation learning techniques transform features into some intermediate representation prior to mapping them to final predictions. Deep Learning (DL) is a form of representation learning that uses multiple transformation steps to create very complex features. DL is widely applied in pattern recognition, image processing, computer vision, and recently in bioinformatics. DL is categorized into feed-forward artificial neural networks (ANNs), which uses more than one hidden layer (y) that connects the input (x) and output layer (z) via a weight (VV) matrix. The weight matrix W which is expected to minimize the difference between the input layer (x) and the output layer (z) is considered as the best one and chosen by the system to get the best results.
Types of CHDs and Prevention and Treatment of CHDs. The methods described herein can be used to diagnose, detect, or predict one or more hearts defects in a subject. There are various types of heart defects which are often grouped by the part of the heart that is affected. The defects could be of the heart valves, the septum, the interior valves of the heart, or the ventricles. Valve defects include for example defects of the aortic and pulmonary valves. Defects in the septum include problems with the wall that separates the right hand and left hand chambers of the heart. When the septum fails to fully close during development of the heart, small holes results in the wall separating the right and left hand chambers. Some defects are due to valves that separate the top and bottom chambers of the heart failing to form properly. Some types of heart defects include a combination of different defects or have to do with the way the heart looped while it was developed.
Examples of some CHDs include aortic valve stenosis (AVS), hypoplastic left heart syndrome (HLHS), ventricular septal defect (VSD) including VSD with atrial septal defect, Tetralogy of Fallot (TOF), coarctation of the aorta (Coarct), atrial septal defect (ASD), pulmonary stenosis (PS) including pulmonary valve stenosis and pulmonary artery valve stenosis, pulmonary artery altresia including those with pulmonary valve stenosis, truncus arteriosus, double aortic arch, and biscuspid A-V valve including those with dilated main pulmonary artery. Patients diagnosed with one or more of these types of CHDs require surgery to prevent severe complications or death.
The methods described herein provide accurate and early detection and prediction of one or more heart defects. Early prenatal diagnosis of CHD allows early evaluation and optimal treatment at birth and reduces the death rate (Holland B J et al. Ultrasound Ostet. Genecology 2015: 45:631-638) and morbidities compared to cases in which the diagnosis is made later after birth. In addition, early postnatal diagnosis of CHD is associated with improved survival compared with late diagnosis in critical CHD cases (Eckersley L, Sadler L etl. Arch Dis Child 2015: 0:1-5). Thus, the methods described herein promote accurate prenatal diagnosis which facilitates earlier evaluation, for example during the newborn period, treatment, and improved survival rate in CHD cases.
The cytosine methylation described herein refers to the chemical addition of a methyl or single carbon atom to the cytosine nucleotide. An important dietary source of the carbon atom used in cytosine methylation is folic acid. Given Applicant's findings demonstrating the importance of cytosine methylation in the pathogenesis of CHD, it is reasonable to expect that dietary folic acid supplementation would reduce the risk of CHD. Currently, folic acid fortification in grains and bread, and direct supplementation for consumption by the entire population including pregnant women has been a standard of care for the prevention of neural tube defects. This presents a ‘natural’ experiment by which to judge the value of folate supplementation for the prevention of CHD. Studies in China (Mao B. et al. 2017), the U.S.A. (Botto et al. 2003), Europe (Czeizel et al. 1998), and Canada (Ionescu-Ittu et al. 2009) showed a beneficial effect of folic acid supplementation in reducing CHD. It should be noted that a significant percentage of pregnant women might not respond adequately to folate supplementation. This is because a significant percentage of females in some populations has a mutation in the methylenetetrahydrofolate reductase (MTHFR) gene which codes for the similarly named enzyme responsible for folate metabolism the end result of which is the generation of methyl carbon involved in cytosine methylation. The result of MTHFR is therefore to impair cytosine methylation. The frequency of this mutation can be as high as 20% in some populations. Also, when the mutation is present enzyme activity is reduced by as much as 30% (Botto et al. 2000). Thus, MTHFR mutation would be expected to blunt the effectiveness of folic acid supplementation in such populations.
Alternate sources of methyl group that are unaffected by the MTHFR gene mutation fortunately exist. These include choline and betaine and exist in dietary sources such as broccoli, spinach, beets, liver, and other foods. Based on Applicant's findings of the importance of DNA methylation in CHD, population fortification and individual supplementation programs for choline and betaine could be evaluated. Further, current evidence indicates that less than one tenth of the U.S.A. population including pregnant women has adequate choline consumption (Zeisel et al. 2009). The risk of deficiency in pregnancy is amplified by the fact that this is a period of increased choline demand.
Laboratory evidence exists for the importance of betaine, an alternate source of single carbon for cytosine methylation, for the prevention of CHD (Karunamuni et al. 2017). Prenatal alcohol exposure is a significant risk factor for CHD. It has been estimated that 10% of pregnant women in the U.S.A. drink alcohol and that 3% have high levels of consumption (Tan et al. 2015) Prenatal alcohol intake has been shown to interfere with single carbon metabolism and DNA methylation in mice (Liu et al. 2009). In the study by Karunamuni et al. (Karunamuni et al. 2017) supplementation with betaine in pregnant mice exposed to alcohol reduced the frequency of CHD with beneficial effects on the structural changes to the heart and great vessels that result from alcohol exposure. Importantly, alcohol caused a reduction in the 5mC nucleotide levels in cardiac neural crest cells of the pups affected with CHD. Cardiac neural crest cells are a specialized group of cells that are embryologically critical for the development of important heart structures such as the interventricular septum, the aorta, pulmonary artery, and other vessels. The 5mC levels of these cardiac neural crest cells were returned to normal with betaine supplementation of pregnant mice who were exposed to prenatal alcohol. The preceding information confirms the importance of methylation changes in heart development. However, they do not provide any information on the use of methylation levels in cf DNA for CHD detection. The method described herein is a minimally invasive blood test to evaluate the mechanisms of CHD. The test can be performed closer to the early development of the heart in the first trimester. Cell-free fetal DNA is known to be deported into the maternal circulation from early in the first trimester. Knowledge of the methylation status of a developing fetus can potentially serve as a basis for supplementation with folate, choline, or betaine in individual patients to help prevent or mitigate the development or severity of CHD.
From a population perspective, the knowledge generated from Applicant's findings i.e. use of cfF DNA from a developing fetus showing evidence of profound changes in cytosine methylation in important cardiac genes in CHD cases could form the basis of a policy of population supplementation with choline and betaine. This would be particularly important for populations with a high rate of MTHFR mutation which renders folate supplementation less effective. Thus, Applicant's findings can be used as a basis for important prophylactic and even therapeutic intervention.
Once a patient is diagnosed with CHD, medication can be used, to keep the heartbeat regular. Immediate treatment of CHD patients includes providing sufficient oxygen level to the patient until repair to the heart can be performed. Surgery, such as open-heart surgery, can be performed on patients to repair any defect to the heart. Other methods of treatment depending on the CHD include antibiotics, cardiac catheterization procedures, open-heart surgery, and heart transplant. Medications for CHD can include angiotensin-converting enzyme (ACE) inhibitors, angiotensin II receptor blockers (ARBs), beta blockers, diuretics, antihypertensives, and others.
In embodiments, the methods described herein enables prophylaxis against development of CHD in a future pregnancy. As an example, the pregnant woman can be supplemented with folic acid or folates.
Tables of Genes and Genomic Loci (CHD Markers). The present disclosure reports a strong association between cytosine methylation status at a large number of cytosine sites throughout the genome using stringent False Discover Rate (FDR) analysis with q-values <0.05 and with many q-values as low as <1×10−30 depending on particular cytosine locus being considered. A total of 12 cases of CHD and 26 normal controls underwent epigenomic analysis. Significant differences in cytosine methylation patterns at multiple loci throughout the DNA that was found in all CHD cases tested compared to normal. The genomic loci described herein are located in or related to known genes. These findings are consistent with the altered expression of multiple genes in CHD cases compared to controls.
Tables 1 to 6 provide genomic loci that can be used individually to predict, detect, or diagnose CHD in patients. The genomic loci are provided underneath each table of Tables 2 to 6. One or more of the genomic loci in Tables 1 to 6 can be selected for predicting or diagnosing CHD in patients. In embodiments, two or more of the loci can be selected from the loci in Table 1, 2, 3, 4, 5, and/or 6.
A total of 5918 CpG genomic loci encompassing 4976 genes (FDR p-value ≤0.01) were identified. Table 1 provides the top 1000 genomic loci obtained by genome-wide methylation profiling performed using cfF DNA. One or more, two or more, up to and including all 1000 of the genomic loci in Table 1 can be selected for predicting, detecting, or diagnosing CHD in a patient. In embodiments, one or more, two or more, up to and including 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1000 genomic loci disclosed in Table 1 can be selected for predicting CHD. In embodiments, the genomic loci have an AUC (with 95% CI), ≥0.70, 0.75, 0.80 0.85, 0.86, 0.87, 0.88, 0.89, 0.90, 0.91, 0.92, 0.93, or 0.94. Of the 1000 genomic loci, 130 CpGs were hypomethylated, and 870 markers were hypermethylated in association with CHD. 53 hypomethylated and 486 hypermethylated markers were found to be differentially methylated by ≥10% methylation difference. The three genomic loci that were above 20% of methylation difference are, cg06301252 (PTPRN2) with 33.62%, cg02807450 (MTMR2) with −21.15%, and cg12900404 (DOCK10) with −20.13%. 126 CpG loci showed AUC ≥0.80 individually, indicating very good to excellent predictive accuracy for the disease prediction. In embodiments, the genomic loci for detecting, predicting, and/or diagnosing CHD include cg06301252, cg02807450, and cg12900404.
AUC integrates sensitivity and specificity values and gives a more precise indication of the accuracy of the test. AUC (with 95% CI) indicates an AUC with a statistically significant 95% confidence interval. An AUC ≥0.70 indicates a clinically useful test.
Tables 2-6 provide the genomic loci obtained by AI analysis using 6 different platforms including SVM, GLM, PAM, RF, LDA, and DL. Although the loci are provided under each of the tables, they belong to the numbered table summarizing the AUC, sensitivity, and specificity for each algorithm. In embodiments, the genomic loci for predicting, detecting, and diagnosing includes the one to five loci for each of the 6 algorithms listed in Tables 2-6. The top 5 predictive markers for each model are provided in each table in a descending order (under each table).
In embodiments, the genomic loci are selected from the algorithms having an AUC (with 95% CI), ≥0.8800, 0.8900, 0.9000, 0.9100, 0.9200, 0.9300, 0.9400, 0.9500, 0.9600, or 0.9700. In embodiments, the genomic loci are selected from algorithms with a sensitivity and/or specificity of ≥0.8700, 0.8800, 0.8900, 0.9000, or 0.9100. In embodiments, the genomic loci are selected from the one to five loci of RF, SVM, or DL of Table 2.
Table 3 shows comparable predictive performance was achieved when demographic markers were considered along with CpG loci. Table 4 shows that high performance was achieved when only markers meeting stringent GWAS thresholds were considered. Table 5 shows high performance was achieved when demographics markers and CpG loci meeting stringent GWAS thresholds were considered. Table 6 shows that when only markers showing high level of methylation change, for example 1.5 fold or greater, are used high predictive accuracies are seen. Table 6 shows that when only markers showing high level of methylation change, for example 1.5 fold or greater, are used high predictive accuracies are seen.
Ranges described throughout the application include the specified range, the sub-ranges within the specified range, the individual numbers within the range, and the endpoints of the range. For example, description of a range such as from one or more up to 1000 includes subranges such as from one or more to 500 or more, from 10 or more to 20 or more, from one or more to five or more, as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, 10, 20, 100, and 500. Moreover, as a further example, the description of a range of ≥0.70 would include all the individual numbers from 0.70 to 1.00 and including 0.70 and 1.0.
The results presented herein confirm that based on the differences in the level of methylation of the cytosine sites between CHD and normal cases throughout the whole human genome, the predisposition to or risk of having a CHD can be determined.
The genomic loci reported enables targeted screening studies for the prediction and detection of CHD based on cytosine methylation throughout the genome. They also permit improved understanding of the mechanism of development of CHD for example by evaluating the cytosine methylation data using gene ontology analysis. In embodiments, the genomic loci are used in many different combinations to predict, detect, or diagnose CHD in a subject. In embodiments, the genomic loci are used to determining or calculating the risk or predisposition of a to having a CHD at any time prenatal or during any period of postnatal life of a subject.
SVM: cg04761177, cg21431091, cg01263077, cg09853933, cg27142059
GLM: cg24479965, cg01094213, cg22467129, cg24748945, cg01949461
PAM: cg27142059, cg09386284, cg16551159, cg04761177, cg01263077
RF: cg04761177, cg16551159, cg14957943, cg06978680, cg12592721
LDA: cg04761177, cg27142059, cg18073832, cg25731807, cg03790075
DL: cg04761177, cg21431091, cg01263077, cg09853933, cg27142059
SVM: cg04761177, cg21431091, cg01263077, cg09853933, cg27142059
GLM: cg24479965, cg01094213, cg22467129, cg24748945, cg01949461
PAM: cg27142059, cg09386284, cg16551159, cg04761177, cg01263077
RF: cg04761177, cg16551159, cg14957943, cg06978680, cg12592721
LDA: cg04761177, cg27142059, cg18073832, cg25731807, cg03790075
DL: cg04761177, cg21431091, cg01263077, cg09853933, cg27142059
SVM: cg04761177, cg15277677, cg03790075, cg04626875, cg11782260
GLM: cg13598434, cg05349624, cg10259004, cg04761177, cg11196182
PAM: cg18198743, cg08316054, cg02394812, cg00280345, cg08052226
RF: cg04761177, cg24412848 0.62684, cg01637563, cg17720707, cg05349624
LDA: cg03790075, cg04761177, cg05349624, cg14809932, cg01637563
DL: cg13598434, cg05349624, cg10259004, cg04761177, cg11196182
SVM: cg04761177, cg15277677, cg03790075, cg04626875, cg11782260
GLM: cg13598434, cg05349624, cg10259004, cg04761177, cg11196182
PAM: cg18198743, cg08316054, cg02394812, cg00280345, cg08052226
RF: cg04761177, cg24412848 0.62684, cg01637563, cg17720707, cg05349624
LDA: cg03790075, cg04761177, cg05349624, cg14809932, cg01637563
DL: cg13598434, cg05349624, cg10259004, cg04761177, cg11196182
SVM: cg09493833, cg27563174, cg22752533, cg05279901, cg03741571
GLM: cg19803352, cg24479965, cg07287606, cg22467129, cg24509810
PAM: cg27142059, cg13598434, cg14200609, cg09493833, cg20360734
RF: cg27142059, cg26078733, cg14200609, cg23273875, cg24509810
LDA: cg27142059, cg09493833, cg19803352, cg17485454, cg07287606
DL: cg27142059, cg26078733, cg14200609, cg23273875, cg24509810
Microarray. Differential methylation can be analyzed using a microarray system. Nucleic acids can be linked to chips, such as microchips. See, for example, U.S. Pat. Nos. 5,143,854; 6,087,112; 5,215,882; 5,707,807; 5,807,522; 5,958,342; 5,994,076; 6,004,755; 6,048,695; 6,060,240; 6,090,556; and 6,040,138. Binding to nucleic acids on microarrays can be detected by scanning the microarray with a variety of laser or charge coupled device (CCD)-based scanners, and extracting features with software packages, for example, Imagene (Biodiscovery, Hawthorne, Calif.), Feature Extraction Software (Agilent), Scanalyze (Eisen, M. 1999. SCANALYZE User Manual; Stanford Univ., Stanford, Calif. Ver 2.3.2.), or GenePix (Axon Instruments). A full panel of loci would include one or more genomic loci listed in Tables 1-6 that have been shown individually to be potentially clinically useful tests AUC0.70.
Kits. Kits for predicting and diagnosing CHD based on methylation of CpG loci on nucleic acids are described. The kits can include the components for extracting nucleic acids including DNA and RNA from the biological sample, the components of a microarray system, and/or for analysis of the differentially methylated genomic sites.
Biomarker detection of CHD as described herein can lead to the early and accurate diagnosis and thus facilitate the management objectives outlined by the CDC. Given the evidence that a significant percentage even a majority of major CHD cases remain undiagnosed, accurate biomarkers are a critical necessary complement to any effective treatment strategy.
Methods disclosed herein include predicting, detecting, or diagnosing CHD and/or calculating risk or disposition to developing CHD. The methods described herein can be used in the prevention and/or treatment (including mitigating or alleviating symptoms) of patients at an early stage to prevent death or the development of severe symptoms associated with CHD. Subjects or patients in need of (in need thereof) predicting, diagnosing, and/or treating are subjects that may have CHD and need to be diagnosed and treated.
As will be understood by one of ordinary skill in the art, each embodiment disclosed herein can comprise, consist essentially of, or consist of its particular stated element, step, ingredient, or component. Thus, the terms “include” or “including” should be interpreted to recite: “comprise, consist of, or consist essentially of.” The transition term “comprise” or “comprises” means includes, but is not limited to, and allows for the inclusion of unspecified elements, steps, ingredients, or components, even in major amounts. The transitional phrase “consisting of” excludes any element, step, ingredient, or component not specified. The transition phrase “consisting essentially of” limits the scope of the embodiment to the specified elements, steps, ingredients, or components and to those that do not materially affect the embodiment. As an example, steps that do not affect the detection, prediction, diagnosis of CHD, or do not affect the prevention or treating of CHD of a patient.
In addition, unless otherwise indicated, numbers expressing quantities of ingredients, constituents, reaction conditions and so forth used in the specification and claims are to be understood as being modified by the term “about.” Accordingly, unless indicated to the contrary, the numerical parameters set forth in the specification and attached claims are approximations that may vary depending upon the desired properties sought to be obtained by the subject matter presented herein. At the very least, and not as an attempt to limit the application of the doctrine of equivalents to the scope of the claims, each numerical parameter should at least be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the subject matter presented herein are approximations, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical values, however, inherently contain certain errors necessarily resulting from the standard deviation found in their respective testing measurements.
When further clarity is required, the term “about” has the meaning reasonably ascribed to it by a person skilled in the art when used in conjunction with a stated numerical value or range, i.e. denoting somewhat more or somewhat less than the stated value or range, to within a range of ±20% of the stated value; ±15% of the stated value; ±10% of the stated value; ±5% of the stated value; ±4% of the stated value; ±3% of the stated value; ±2% of the stated value; ±1% of the stated value; or ±any percentage between 1% and 20% of the stated value.
The terms “a,” “an,” “the” and similar referents used in the context of describing the claimed subject matter (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context.
The following exemplary embodiments and examples illustrate exemplary methods provided herein. These exemplary embodiments and examples are not intended, nor are they to be construed, as limiting the scope of the disclosure. It will be clear that the methods can be practiced otherwise than as particularly described herein. Numerous modifications and variations are possible in view of the teachings herein and, therefore, are within the scope of the disclosure.
Exemplary Embodiments include but are not limited to:
1. The methods described herein includes the use of nucleic acid obtained from a biological sample of a subject for the diagnosis, detection and/or prediction of CHD. The subject can be a fetus, embryo, newborn, infant, child, adolescent, or an adult. The subject can be a pregnant woman. The biological sample include tissue sample including placental tissue or body fluid such as blood, plasma, serum, urine, saliva, sputum, sweat, tears, genital secretion including cervical secretion, amniotic fluid, and umbilical cord blood obtained at birth. In embodiments, the nucleic acid is DNA. The DNA can be cellular DNA or cell-free (cf) DNA. The cf DNA can be cf fetal (cfF) DNA.
2. The methods described herein includes the use of DNA and cfF DNA to determine the epigenetic mechanism of CHD.
3. The epigenetic mechanism described herein includes all forms of epigenetic testing based on the use of DNA and cfF DNA such as DNA methylation changes and histone modification including but not limited to methylation, acetylation, sumolyation and phosphorylation. Histones are the proteins around which the DNA strands are wrapped. Histone modification helps direct changes in DNA cytosine methylation that has been discussed, and plays a pivotal role in gene
4. The nucleic acid methylation changes described herein including DNA methylation changes include the various forms of methylation changes including cytosine methylation, cytosine hydroxymethylation, and other forms of cytosine epigenetic modification. Adenine nucleotide can also undergo DNA methylation changes. Therefore, the methods described herein include the use of cfF DNA to measure cytosine modification in all its forms that would fall within the definition of adenine methylation used for the purpose of predicting or monitoring CHD using cfF DNA.
6. DNA epigenetic changes as described herein (including methylation changes in all its forms in cytosine or adenine and including histone epigenetic modification in all its forms) based on cfF DNA for the detection of other fetal congenital anomalies in which DNA methylation or histone modification plays a role in its development, for example, neural tube defects, cleft lip, and palate.
7. The methods described herein includes the use of cfF DNA for epigenetic monitoring of the effect of exposures on a pregnancy with respect to the risk of development of relevant cardiac anomalies described herein, such as alcohol, or medications, chemicals or other known risk exposures, or disorders such as maternal diabetes or hypertension in an ongoing pregnancy.
8. A method of predicting or diagnosing congenital heart defect (CHD) in a subject in need thereof, wherein the method includes assaying a biological sample, obtained from the subject, including cf nucleic acids to determine frequency or percentage of cytosine methylation at one or more loci throughout genome; and comparing the cytosine methylation level of the sample to cytosine methylation of a control sample.
9. The method of any one of embodiments 1-8, wherein the method further includes using artificial intelligence (AI) techniques.
10. The method of any one of embodiments 1-9, wherein the method further includes using (AI) techniques comprising one or more of the following machine learning algorithms: Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Prediction of Analysis for Microarrays (PAM), Generalized Linear Model (GLM), or deep learning (DL).
11. The method of any one of embodiments 1-10, wherein the method further includes calculating the subject's risk of developing CHD.
12. The method of any one of embodiments 1-11, wherein the control sample includes one or more biological samples from one or more normal (healthy) patients or from one or more patients diagnosed with CHD.
13. The method of any one of embodiments 1-12, wherein the biological sample includes body fluid.
14. The method of any one of embodiments 1-13, wherein the biological sample includes blood, plasma, serum, urine, saliva, sputum, sweat, breath condensate, tears, genital secretion including cervical secretion, amniotic fluid, placental tissue, CVS specimen, and umbilical cord blood obtained at birth.
15. The method of any one of embodiments 1-14, wherein the cf nucleic acids include cfF nucleic acids.
16. The method of any one of embodiments 1-15, wherein the biological sample includes cfF nucleic acids from first trimester, second trimester, and/or third trimester of pregnancy.
17. The method of any one of embodiments 1-16, wherein the cf nucleic acids include DNA.
18. The method of any one of embodiments 1-17, wherein the one or more loci include one or more loci from Tables 1-6.
19. The method of any one of embodiments 1-18, wherein the one or more loci include at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least 10 loci from Tables 1-6.
20. The method of any one of embodiments 1-19, wherein the one or more loci include an AUC (with 95% CI) of greater than 0.70, 0.75, 0.80, 0.85, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, or 0.97.
21. The method of any one of embodiments 8-20, wherein the assay is a bisulfite-based methylation assay or a whole-genome methylation assay.
22. The method of any one of embodiments 1-21, wherein the one or more loci include cg06301252, cg02807450, or cg12900404.
23. The method of any one of embodiments 1-22, wherein the one or more loci include cg04761177, cg21431091, cg01263077, cg09853933, cg27142059, cg16551159, cg14957943, cg06978680, or cg12592721.
24. The method of any one of embodiments 1-23, wherein the method further includes treating the CHD.
25. The method of any one of embodiments 1-24, wherein the method further includes treating the subject by administering medication and/or performing surgery on the subject.
26. The method of any one of embodiments 1-23, wherein the method prevents CHD in future pregnancy.
27. The method of embodiment 26, wherein the method includes supplementing the pregnant mother with folic acid or folates.
In this study, a total of 12 cases of CHD were analyzed. The 12 cases included the following: 1 case—Pulmonary artery atresia with Pulmonary valve stenosis, 4 cases—Ventricular septal defects (aka VSD); 1-case of Truncus arteriosus; 2-cases of Tetralogy of Fallot; 1-case of pulmonary artery valve stenosis; 1-case of Ventricular septal defect with atrial septal defect also; 1-case of Double aortic arch and 1-case of Bicuspid A-V valve with dilated main pulmonary artery. In total there were 12 cases of CHD. The demographics are provided in Supplemental Table 51. Table 1 provides the epigenomic data. Tables 2-6 provide epigenomic data obtained in combination with artificial intelligence techniques.
Introduction. In a series of studies (Radhakrishna et al. 2018; Bahado-Singh et al. 2019a; Radhakrishna et al. 2019), methylation changes were demonstrated in the placental DNA identified gene and gene networks that were epigenetically dysregulated in isolated VSD and non-syndromic tetralogy of Fallot, two of the most important categories of CHD. The studies helped shed important light on the pathogenic mechanisms of these CHDs. Further, CHD was accurately screened using DNA methylation markers from the placenta. While the placenta, given its abundance and limited clinical value, is ideal for such studies, applying the results to the prenatal prediction of CHD represents a significant challenge. Obtaining placental trophoblast tissue useful for DNA analysis generally requires invasive procedures such as chorionic villus sampling or placental biopsy (Alfirevic et al. 2003). The procedure is painful, requires specialist expertise, and is potentially associated with increased risk for pregnancy complications. These prior studies did not however address the issue of whether cf DNA released from the placenta into the circulation could be used to detect CHD in the developing fetus or embryo.
In the last two decades, genomic analysis of cfF DNA, present in maternal circulation in pregnancy has now progressed to the wide clinical utilization for the detection of fetal aneuploidies (Goldwaser & Klugman 2018) and other chromosomal (Ke et al. 2015; Grace et al. 2016) and molecular pathologies (Stewart et al. 2018). “Cell-free fetal DNA” however, is an inexact term as the DNA is actually from the placenta itself, an embryological fetal tissue. There is constant proliferation, differentiation, and apoptosis of the placental trophoblast (Taglauer et al. 2014). Placental apoptotic material is continuously shed into the maternal circulation. This trophoblast apoptotic material constitutes the cell-free “fetal” DNA found in the maternal blood (Gupta et al. 2004; Tjoa et al. 2006) and clinical studies (Wataganara et al. 2005; Alberry et al. 2007). Cell-free “fetal” DNA constitutes a significant percentage of the cf DNA in the maternal circulation. The contribution of the fetal friction to overall maternal cf DNA blood levels accelerates with advancing gestational age (Rafaeli-Yehudai et al. 2018). In the following study, cf DNA was extracted from the plasma (Zolotukhina et al. 2005) of pregnant women. DNA methylation analysis was performed on the cf DNA. DNA methylation analysis based on cf DNA was used to help elucidate the epigenetic pathogenesis of CHD development and also to predict non-syndromic CHD based on the DNA methylation changes observed in the circulating cfF DNA in the maternal circulation of mid-trimester pregnancies.
Methods: Study samples and cf DNA extraction. The human ethical committee has approved the present study and each subject of the study has provided informed written consent. A total of 12 cases and 26 controls were analyzed. The subjects were the mothers who gave birth to CHD cases and normal babies. The mean age of cases was 29.7 and controls were 31.3 years. cfFDNA was extracted from maternal blood. Blood samples were obtained from the subjects during pregnancy. The blood samples were drawn directly into Streck Cell-Free DNA BCT® tubes, so as to ensure the good quality of cf-DNA from the plasma (Bartak et al. 2019). Following this, the sampled tubes were processed further by centrifuging them at 3000×g for 15 minutes within the 24 hours of blood draw. The plasma was aliquoted into 2.0 ml micro-centrifuge collection tubes without disturbing buffy coat of the sample. The aliquoted plasma samples were stored at −80° C. until further process (Sheinerman et al. 2017). Five ml of plasma was used to extract cf-DNA using QIAamp circulating nucleic acid kit (Qiagen Cat #55114). The process used was a manual vacuum process using QIAvac 24 Plus vacuum manifold following the manufacturer's protocol. The silica membrane technology which can bind fragmented DNA enabled the efficient recovery of DNA. The method allowed us to use the DNA for further bisulfite conversion and methylation profiling by maintaining the methylation status of DNA.
Bisulfite conversion and methylation array processing. The bisulfite conversion of DNA was performed using EZ DNA Methylation Gold Kit (Zymo, USA) according to the company's protocol using 10 ul of elution buffer (Hardy et al. 2017). The methylation profiling was performed using Illumina Infinium Methylation EPIC BeadChip arrays with over 850K methylation markers according to the manufacturer's instructions. Samples were randomized on the chips. Both cases and control sample process were performed together to avoid any sample processing bias. The arrays were dried using a vacuum drier and the processed BeadChips were imaged using illumina iScan System as per the company's instructions.
Statistical and bioinformatic analysis. After scanning, the raw iDat files were downloaded from the iScan software and the data were analysed using GenomeStudio software. The β-values of cases versus control subjects were compared for individual CpG loci to perform differential methylation analysis. Previous publications described further downstream statistical analysis (Bahado-Singh et al. 2019a; Bahado-Singh et al. 2019b). For the further downstream analysis, CpG probes constituting X and Y chromosomes followed by dbSNP entries within 10 bp of CpG sites were excluded to avoid gender bias and the genetic effects respectively on methylation sites (Wilhelm-Benartzi et al. 2013). The β-value with BH adjusted FDR p-value <0.01 was considered as cutoff for the significance. The Area Under the Receiver Operating Characteristic (AUC-ROC) curves with 95% CI was calculated for each significant CpG loci using dplyr, reshape2 and ROCR packages of R tool.
Artificial Intelligence (AI) Analysis. The methods used for the AI analyses performed are described in Bahado-Singh et al. 2019b. What follows is a summary of the previously published descriptions. The epigenomic data were divided into two groups, a training group consisting of 80% of the study subjects and a test group that constituted the remaining 20%. It is an approach that is frequently used when analyzing smaller small data sets. We performed 10-fold cross-validation on the data from the training group in order to generate the prediction model. This model was then appraised in the independent validation or test group. Six appropriate AI approaches or algorithms were used to determine screening performance for CHD detection. These were Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Prediction Analysis for Microarrays (PAM), Generalized Linear Model (GLM), and Deep Learning (DL) which is the newest form of AI and which is increasingly being used in the analysis of complex and voluminous biological data such as epigenomics.
RF is a supervised classification algorithm for classification, regression, and other functions. A forest of decision trees is randomly created and the mean prediction of the individual trees is determined. There is a direct correlation between the number of trees in the forest and the accuracy of the results that are generated. Increasing the number of trees will increase the accuracy of the results that are obtained. RF has the benefits of being able to work with missing values in a data-set and can utilize categorical values (Huang et al. 2013). SVM is first fed with labelled data (supervised learning) identifying the different groups and from this builds a model for distinguishing the groups. Subsequently, when provided with unlabelled fresh data it is able to develop models or hyperplanes to cluster one group from another. SVM was able to perform both regression and classification tasks and can handle multiple continuous and categorical variables (Mahadevan et al. 2008). LDA was used to reduce the number of features or predictors need to accurately classify and discriminate the groups. This is particularly useful for epigenomic analysis as the study started out with close to 900,000 potential features to be used for CHD detection. LDA is simple in its approach but can still achieve excellent accuracy. Accuracy is as good as more complex methods. LDA is based on the identification of a linear combination of variables (predictors) that best separates the two classes (targets) (Liland 2011). It is closely related to the analysis of variance (ANOVA) and regression analysis which attempts to define an outcome variable based on a combination of explanatory variables. PAM is a statistical technique for class prediction from gene expression data using the nearest shrunken centroids (Alakwaa et al. 2018; Candel et al. 2018). This method identifies the subsets of genes that best characterize each class. GLMs are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models, and others (Alakwaa et al. 2018; Candel et al. 2018). DL is a form of representation learning that uses multiple transformation steps to create very complex features. DL is categorized into feed-forward artificial neural networks (ANNs), which uses more than one hidden layer (y) that connects the input (x) and output layer (z) via a weight O) matrix. The weight matrix is expected to minimize the difference between the input and output layers and is considered as the best AI approach (Alakwaa et al. 2018; Candel et al. 2018).
Multivariate Regression Analysis provided with model. Before the multivariate regression analysis was performed epigenomics data were subjected to quantile normalization and auto-scaling. As a quality control step, in an attempt to investigate the existence of any systematic variation and to detect potential outlier(s), principal component analysis (PCA) was performed on all classes using MetaboAnalyst (v4.0) (Chong et al. 2018). Subsequently, these pre-processed data were used to perform partial least-squares discriminant analysis (PLSDA). The ideal number of CpG β-value variables were carefully chosen based on predictive accuracy and cross validation using leave one out cross-validation method available in MetaboAnalyst. Goodness of fit (R2) and predictability (Q2) values for each PLSDA model have been reported. On cross-validation, a 2000 iteration permutation test was performed that can minimize the possibility of observed separation on PLSDA was due to chance considering the p-value <0.05.
Cluster analysis of differentially methylated targets in cfF DNA CHD. A heatmap was generated using individual β-values for the significantly differentially methylated markers between cases and controls (
Gene ontology analysis and functional enrichment. The genes found to be significantly differentially methylated with FDR p-value <0.01 were used to perform disease and functional enrichment analysis using Ingenuity Pathway Analysis (IPA) (Qiagen IPA) system. The IPA platform enables systemic analysis of array data associated with the biological function (Haddad et al. 2016). The gene networks were considered based on their inter-relationship and role in cardiac development and diseases.
Results. Genome-wide methylation profiling was performed using cfF DNA extracted from the mothers who gave birth to 12 CHD cases and 26 controls. The gestational age at the time of sampling also did not show any statistical difference (p-value=0.15). The mean (SD) gestational age at blood draw was 23 weeks 5 days for cases and 24 weeks 6 days for controls (p−0.15). The age range and gestational age at the time of sampling of mothers who gave birth to CHD babies and mothers who gave birth to normal babies did not show significant difference with the p-value of 0.32 and 0.15 respectively. The details of the demographics are provided in Table 51.
Differential methylation analysis of cfF DNA identified a total of 5918 CpG markers encompassing 4976 genes (FDR p-value ≤01). The top 1000 significant CpG loci and associated genes with University of California Santa Cruz (UCSC) gene symbol, and the statistics is been provided in the Table 1. Of these markers 130 CpGs were hypomethylated and 870 markers were hypermethylated in association with CHD. 53 hypomethylated and 486 hypermethylated markers were found to be differentially methylated by ≥10% methylation difference. The 3 markers those were above 20% of methylation difference were, cg06301252 (PTPRN2) with 33.62%, cg02807450 (MTMR2) with −21.15% methylation change and cg12900404 (DOCK10) with −20.13%. 126 CpG loci showed AUC ≥0.80 individually, indicating the excellent predictive accuracy for the disease prediction.
AI prediction of methylation markers for cfF DNA CHD. The significantly differentially methylated CpGs with 5% of difference and AUC >0.70 has been used to perform AI analysis using 6 predictive algorithms. Among them, RF and SVM model was found to show highest performance with AUC (95% CI)=0.98 (0.831−1) with 93.8% sensitivity and 93.2% specificity and with AUC (95% CI)=0.97 (0.877−1) with 98% sensitivity and 94% specificity, respectively, followed by DL model with AUC (95% CI)=0.94 (0.840−1.0) with 93% sensitivity and 94% specificity (Table 2). High performance was also achieved for each of the other four platforms. Comparable predictive performance was achieved when demographic markers were considered along with CpG loci (Table 3) and also high performance was achieved when only markers meeting stringent GWAS thresholds were considered (Table 4). High performance was achieved when demographics markers and CpG loci meeting stringent GWAS thresholds were considered (Table 5). Table 6 shows that when only markers showing high level of methylation change, for example 1.5 fold or greater, are used high predictive accuracies are seen. The top 5 predictive markers for each model are provided in each table in a descending order (under each table).
Logistic Regression analysis. Conventional logistic regression analysis with 10-fold cross validation was performed and compared markers with the AI prediction for the CpGs with 5% of difference and AUC >0.70. On training set, AUC (95% CI) of 0.98 (0.98−0.97) with 100% sensitivity and 85% specificity was obtained, following the training logistic regression model with a 10-fold cross validation provided AUC values (95% CI) of 0.79 (0.61−0.98) with 83% sensitivity and 80% specificity provided on a test set. The logistic regression equation was as follows,
logit(P)=log (P/(1−P))=−5.469−2.698 cg08230215−3.924 cg04761177−2.478 cg10259004−2.477 cg06009031.
For the CpG markers with stringent p-value 5×10−8, AUC (95% CI)=0.99 (0.92−0.99) with 90% sensitivity and 95% specificity on training/discovery test was found. For the 10-fold cross validation showed AUC (95% CI)=0.71 (0.50−0.92) with 75% sensitivity and 87% specificity having the equation:
logit(P)=log(P/(1−P))=−2.583−5.115cg04761177−6.351cg23306063+1.126cg10259004−0.369cg23136742−2.917cg11196182.
Cluster analysis of differentially methylated targets in cfF DNA CHD. In hierarchical cluster analysis using individual β-values for significantly differentially methylated markers between cases and controls, well separated clusters for hypo and hypermethylation among cases and controls were observed (
Disease and functional enrichment analysis. To determine whether the statistical findings had biological plausibility, pathway analyses were performed. Ingenuity Pathway systems showed significant disease and functional enrichment of the genes associated with CHD. The top 3 cardiac development and disease function showed enrichment in Congenital Heart Disease (p-6.69E-03), Cardiac Hypertrophy (p-4.34E-03) and Cardiogenesis (p-3.73E-05).
Discussion A comprehensive methylation profiling combined with AI prediction were performed using the free circulating cfF DNA from mother's blood. The study indicated the significant differential methylation of 5918 CpG markers comprising 4976 genes between cases and controls. cf-DNA is one of the promising biomarker source for prenatal diagnosis which is minimally invasive and slowly substituting the other invasive tests such as amniocentesis or chorionic villus sample based tests (Nagy 2019). Epigenetic factors regulate gene expression and affects heart development during embryogenesis (Vallaster et al. 2012). The 1278 hypomethylated genes tend to overexpress in the cells (Klasic et al. 2016) while the remaining 4640 hypermethylated genes may show downregulated expression in the cells (Razin & Kantor 2005). 539 CpGs showed methylation difference 10% indicating the biological relevance of gene expression. The study samples were found to be separated based on the PLS-DA analysis and the differences between hyper and hypomethylated markers are shown in the heatmap (
AI analysis was performed using 6 different algorithms including SVM, GLM, PAM, RF, LDA and DL. The SVM model provided the best prediction with an AUC=0.97 (98% sensitivity and 94% specificity). The top 5 predictive markers include cg04761177 (ATP2A1), cg21431091 (TMEM9), cg01263077 (MYO9B), cg09853933 (ATG2B; GSKIP) and cg27142059 (TRIM15). To compare with the AI prediction, the markers were tested using conventional logistic regression analysis with 10 fold cross validation and found cg08230215 (MAST3), cg04761177 (ATP2A1), cg10259004 (MYL9) and cg06009031 (C7orf50). The common marker among predictive algorithms SVM, PAM, RF, LDA, DL and logistic regression was found to be cg04761177 (ATP2A1). cg04761177 has been hypomethylated with 9% methylation difference, with AUC=0.94 (CI 0.86-1.00) for CHD prediction and the gene ATP2A1 is also termed as SERCA1. This gene regulates the electrical and contractile properties of heart and thereby dysregulation is associated with diverse heart diseases including Congestive Heart Failure (Peters et al. 1997; Ennis et al. 2002).
The second common CpG marker predicted to be excellent marker from both AI algorithms with stringent p-value-5×10-8 and conventional logistic regression was cg10259004 (MYL9). MYL9 gene codes for Myosin Light Chain, and expresses in the cardiac smooth muscles and participate in the morphogenesis of heart (England & Loughna 2013).
Disease and functional enrichment of genes with CpGs in cf DNA associated with CHD. The disease based pathway enrichment analysis has identified significantly differentially methylated genes that are currently known or predicted to be associated with cardiac hypertrophy, cardiogenesis and congenital heart disease. This indicates the biological plausibility of the identified genes in association with CHD. All identified CpG methylation sites were identified within promoter or in the gene region and associated with mechanism of cardiac hypertrophy, cardiogenesis and congenital heart disease with a significant p-value <0.001. For example, some of the hypomethylated genes include, HSPB11, POFUT1, NFATC4, DTNBP1, CFLAR, KCNH2, B3GAT3, BMP4 and the hypermethylated genes include, BMP7, NOG, MAP2K2, FGF9, ADAM17 and MAPK3.
HSPB11 was found to be associated with cardiogenesis and highly expressed in later stages of ventricular tissue development in zebrafish (Singh et al. 2016). POFUT1, is an essential component of Notch signalling, that plays a vital role in the development of the heart valves, cardiac outflow tracts and ventricular septum formation (Penton et al. 2012). The mouse deficient with protein POFUT1 die during the mid-gestation with severe defects in vasculogenesis and cardiogenesis (Shi & Stanley 2003). NFATC4 and DTNBP1 were involved in inducing human cardiac hypertrophy (Poirier et al. 2003; Rangrez et al. 2013). Reduced expression of CFLAR gene have shown to exhibit severe defects on cardiac trabecular formation, thinner myocardium, cardiac lethality and cardiac ventricular structures (Imanishi et al. 2000; Yeh et al. 2000; Lakhani et al. 2006; Ye et al. 2013). KCNH2, codes for a protein known as Kv11.1 (potassium ion channel) which conducts potassium ions out of the cardiac myocytes (Newton-Cheh et al. 2007; Park et al. 2013). A retrospective study in large cohort investigated that mutations in KCNH2 results in atrioventricular block, tetralogy of Fallot, Coarctation, atrial septal defect, atrioventricular canal, bicuspid aortic valve, patent ductal arteriosus, tricuspid atresia and ventricular septal defect (Ebrahim et al. 2017). Another gene, B3GAT3 plays a significant role in proteoglycan biosynthesis (von Oettingen et al. 2014) and mutation in these proteoglycan chain results in severe congenital heart anomalies including bicuspid aortic valve, ventricular septal defects, and mitral valve prolapse (Baasanjav et al. 2011; Bloor et al. 2017). BMP4 is an essential source for the development of endocardial cushion and for normal partitioning of the Outflow tract. In mice, loss of BMP4 functioning results in ventricular septal defects, abnormal semilunar valve formation, abnormal cushion remodelling, persistent truncus arteriosus and inadequate cardiac differentiation in the developing epicardial cushion (McCulley et al. 2008).
The hypermethylated genes such as BMP7 encodes for a secreted ligand of the TGF-beta that plays a key regulatory role in coronary vasculature, ventricular myocardial development and compaction (Azhar et al. 2003). In mouse embryo, low BMP7 is linked to increased cardiovascular disease morbidity and mortality (Silverman et al. 2004) (Freedman et al. 2009). Another hypermethylated gene, NOG (Noggin) has strong interaction with BMP2 and BMP4, however it also interacts with BMP7 (Choi et al. 2007). It is notable that both BMP4 and BMP7 has been differentially methylated in the present study along with NOG. The Noggin knocked-out mice shows several anomalies of cardiovascular development those are the results of BMP signaling and predicted to play analogous mechanism in humans (Choi et al. 2007). MAP2K2 (Mitogen-Activated Protein Kinase Kinase 2), another hypermethylated gene, earlier reported to be associated with increased prevalence of Cardiac hypertrophy by activating ERK pathway (Gillespie-Brown et al. 1995; Gallo et al. 2019). The other map kinase family gene, MAPK3 is involved in the regulation of meiosis, mitosis and post-mitotic function and the dysregulation in this gene advocates cardiac hypertrophy (Bueno et al. 2000; Mutlak & Kehat 2015). FGF9 gene is largely expressed in the epicardium that maintains myocardial proliferation during mid-gestational cardiac development. Altered expression of FGF9 is associated with significantly decreased cardiomyoblast proliferation and ventricular hypoplasia. (Lavine et al. 2005). FGF9 also functions as paracrine signals in the embryonic heart development and loss of function is associated with decreased cardiomyocyte proliferation (Itoh et al. 2016). A Disintegrin And Metalloproteinase Domain-Containing Protein 17 gene (ADAM17) plays a significant role in structural cardiac remodeling by changing cell-surface matrix receptors and the loss of function of ADAM17 contributes to cardiac hypertrophy (Wang et al. 2009; Takayanagi et al. 2016). Altered expression of ADAM17 has also been marked with ventricular remodelling suggesting its important role in late stages of cardiac remodelling (Zheng et al. 2016).
This proof of concept study confirms the effect of DNA methylation on causing CHD and identifying this using cfF DNA is a possible approach. The diagnostic accuracies predicted using various statistical methods and AI platform helped to prioritize the genes those are novel for CHD prediction.
In conclusion, significantly differentially methylated markers associated with CHD in cfF DNA were identified. The AI algorithms predicted significant markers associated with CHD and the disease enrichment analysis showed important genes associated with cardiac development and function. Developing minimally invasive methods to perform prenatal diagnosis is of clinical importance. Understanding the role and regulation of the identified genes using cfF DNA in this study would further form the basis of understanding the molecular mechanisms for the embryological development of the normal and abnormal heart of the earliest stages of pregnancy.
The subject matter described above is provided by way of illustration only and should not be construed as limiting. Various modifications and changes may be made to the subject matter described herein without following the example embodiments and applications illustrated and described, and without departing from the true spirit and scope of the present disclosure, which is set forth in the following claims.
All publications, patents and patent applications cited in this specification are incorporated herein by reference in their entireties as if each individual publication, patent or patent application were specifically and individually indicated to be incorporated by reference. While the foregoing has been described in terms of various embodiments, the skilled artisan will appreciate that various modifications, substitutions, omissions, and changes may be made without departing from the spirit thereof.
This application claims the benefit of U.S. Provisional Patent Application 62/941,357, filed on Nov. 27, 2019, which is hereby incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/062194 | 11/25/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62941357 | Nov 2019 | US |