Aspects of the present invention relate generally to methods for measuring genomic DNA methylation loss, and more particularly to methods enabling measurement of genomic DNA methylation loss that is linked to cellular replicative/mitotic history. Additional aspects relate to methods for measuring mitotic turnover rate, chronological age of a cell or tissue, excessive replicative turnover, increased risk for conditions associated with excessive replicative turnover or aging, identification of subjects for increased surveillance, cancer screening, forensic analysis, etc.
This application claims priority to U.S. Provisional Application 62/637,979 filed on Mar. 2, 2018, the disclosure of which is considered part of the disclosure of this application and is hereby incorporated by reference in its entirety.
The contents of the text file named “2019_03_01_SequenceListing ST25.txt” which was created on Mar. 1, 2019, and is 74.8 KB in size, are hereby incorporated by reference in their entirety.
Loss of 5-methylcytosine in both benign and malignant neoplasms was discovered more than thirty years ago (1-4), yet the mechanisms that lead to this hypomethylation and its role in disease remain poorly understood. Genomic studies (5-9) established that hypomethylation occurs in only about half the genome, coinciding with megabase-scale domains of repressive chromatin characterized by low gene density, low GC-density, late replication timing, localization at the nuclear lamina, and Hi-C “B” domains (10,11). These regions were termed “Partially Methylated Domains” (PMDs), and were contrasted with “Highly Methylated Domains” (HMDs) that make up the remainder of the genome (12). PMDs have been confirmed as a common feature of most epithelial cancers (13), and other cancer types such as pediatric medulloblastoma (14).
Conflicting evidence suggests that PMD hypomethylation could provide tumors with a growth advantage or alternatively may represent only a side effect of cancer (15, 16). An understanding of the earliest origins of this process could help elucidate a potential role of PMD hypomethylation in cancer initiation, yet results in pre-cancer cell types have been conflicting. Since the 1980s, long-term cell culture has been known to result in significant DNA hypomethylation (17), which was later discovered to occur primarily in PMD domains (8, 12, 18, 19) and to accumulate stochastically in culture (20, 21). In primary uncultured tissues, one study showed the existence of PMDs in a few highly proliferative tissues such as peripheral white blood cells and placenta, but not in slowly dividing tissues like kidney, lung, or brain (9). Other studies have shown the presence of global hypomethylation in placenta (22) and more differentiated B cells (23) and T cells (24), but not in early stage B cells or T cells nor in myelocytes (23, 24). The largest whole-genome bisulfite sequencing (WGBS) study of normal tissues concluded that PMDs were undetectable in 17 of 19 human tissue types studied (34 of 37 total samples), with the only exceptions being placenta and pancreas (25). This reinforced the prevailing view that PMD hypomethylation may be restricted to a very limited set of normal cell types, or only initiated upon exposure to environmental factors such as carcinogens (26). Applicants and one other group detected a small degree of PMD hypomethylation in normal mucosa adjacent to colon tumors (5, 6), but could not rule out a pre-cancer “field effect” in these adjacent tissues.
There is a need to investigate the dynamics of hypomethylation across a large number of normal and malignant tissues, and to develop new methods to enable determination of whether there are PMDs shared by normal mammalian cells and cancer cells, to enable further definition of possible relationships between PMDs, other chromatin features, and genomic mutational processes.
Particular aspects provide the largest and most diverse set of WGBS experiments to date, including new tumor and adjacent normal data from 8 common cancer types. By identifying a local sequence signature that defined the most strongly hypomethylated CpGs within PMDs, we were able to determine that most PMDs are shared by cancers and nearly all healthy human and mouse tissue types starting from fetal development. This allowed, for the first time, investigation of the dynamics of hypomethylation across a large number of normal and malignant tissues, and definition of the relationship between PMDs, other chromatin features, and genomic mutational processes.
In certain aspects, the present methods can be used to derive mitotic age for each tissue type separately, and derive a mapping for the corresponding tissue type/cell type. Such tissue/cell-type variation can be well controlled and exploited in cell-sorting based methods.
As disclosed and described herein, a set of 39 diverse primary tumors and 8 matched adjacent tissues was profiled using Whole-Genome Bisulfite Sequencing (WGBS), and analyzed them alongside 343 additional human and 206 mouse WGBS datasets. A local CpG sequence context associated with preferential hypomethylation in PMDs was identified. Surprisingly, analysis of CpGs in this context (“Solo-WCGWs”, disclosed herein) revealed previously undetected PMD hypomethylation in almost all healthy tissue types. PMD hypomethylation increased with age, beginning during fetal development, and appeared to track the accumulation of cell divisions. In cancer, PMD hypomethylation depth correlated with somatic mutation density and cell-cycle gene expression, consistent with its reflection of mitotic history, and suggesting its application as a mitotic clock.
According to particular aspects of the present invention, therefore, late replication leads to lifelong progressive methylation loss, which acts as a biomarker for cellular aging and which, according to additional aspects, contributes to oncogenesis.
Particular surprisingly effective aspects provide a method comprising: a) identifying a test cell or tissue sample for which a determination of replication-associated DNA methylation loss is desired; b) obtaining, at data processing apparatus, CpG dinucleotide sequence methylation data for genomic DNA derived from the test cell or test tissue sample, wherein the genomic DNA comprises highly methylated domains (HMD) and partially methylated domains (PMD), wherein each such CpG dinucleotide is the sole CpG dinucleotide sequence within a n(x)WCpGWn(x) genomic DNA sequence motif (Solo-WCGW motif) of at least one PMD, and wherein W=A or T, n=A or G or C or T, and x≥9; c) determining, at the data processing apparatus, based on the CpG dinucleotide sequence methylation data, a mean or average CpG dinucleotide methylation value, or a value related thereto, for a plurality of Solo-WCGW motif sequences of the at least one PMDs, to provide a measure of cellular replication-associated DNA methylation loss (e.g., compared to HMD), wherein the provided measure of replication-associated DNA methylation loss reflects a cumulative number of cell divisions or mitotic history; and d) based on the provided measure of replication-associated DNA methylation loss, reaching a conclusion, at the data processing apparatus, as to a condition or state of the test cell or tissue sample. In the methods, obtaining the genomic CpG dinucleotide sequence methylation data may comprise excluding at the data processing apparatus, from a larger set of genomic CpG methylation data, methylation data of CpG dinucleotide sequences not within the Solo-WCGW motif sequences of the at least one PMD. In the methods, obtaining the genomic CpG dinucleotide sequence methylation data may comprise excluding, at the data processing apparatus, from a larger set of genomic CpG methylation data, methylation data of non-intergenic Solo-WCGW motif sequences of the at least one PMD. In the methods, obtaining the genomic CpG dinucleotide sequence methylation data may comprise excluding, at the data processing apparatus, from a larger set of genomic CpG methylation data, methylation data of H3K36me3 histone marked Solo-WCGW motif sequences of the at least one PMDs. In the methods, obtaining the genomic CpG dinucleotide sequence methylation data may comprise excluding cell type invariant proxies for H3K36me3 histone marked Solo-WCGW motif sequences, such as those falling in transcribed gene bodies. In the methods, the plurality of Solo-WCGW motif sequences of the at least one PMDs may be located at one or more PMDs of a single chromosome. In the methods, the plurality of Solo-WCGW motif sequences of the at least one PMDs may be located between or among multiple chromosomes. In the methods, x may be a value selected from the group consisting of at least 9, at least 14, at least 19, at least 24, at least 29, at least 34, at least 39, at least 44, at least 49, at least 54, at least 59. In the methods, x may be a value in a range selected from the group consisting of about 9-49, 9-99, 9-149, 9-199, 14-49, 14-99, 14-149, 14-199, 19-49, 19-99, 19-149, 19-199, 24-49, 24-99, 24-149, 24-199, 29-49, 29-99, 29-149, 29-199, 34-49, 34-99, 34-149, 34-199, 39-49, 39-99, 39-149, 39-199, 44-49, 44-99, 44-149, 44-199, 49-99, 49-149, 49-199, 54-99, 54-149, 54-199, 59-99, 59-149, 59-199, and any subranges of the preceding ranges. In the methods, x may be 34±25 (e.g., in the range of 9-59). In the methods, x may be 34±15 (e.g., in the range of 19-49). In the methods, x may be 34 or about 34. In the methods, the Solo-WCGW motif may comprise the sequence n(x−1)mWCpGWGn(x−1), and wherein W=A or T, n=A or G or C or T, m=C or A, and x≥9 (with x varying as given above). In the methods, the Solo-WCGW motif may comprise the sequence n(x−1)CWCpGWGn(x−1), and wherein W=A or T, n=A or G or C or T, and x≥9 (with x varying as given above). In the methods, the at least one PMDs may be characterized, at least in part, by late replication timing and/or nuclear lamina localization, and/or Hi-C-defined heterochromatic “compartment B”. In the methods, the at least one PMDs may be, at least in part, defined by assessing, at the data processing apparatus, the CpG dinucleotide sequence methylation data of the Solo-WCGW motif sequences (e.g., at least in part defined by assessing, at the data processing apparatus, the standard deviation (SD) of the CpG dinucleotide sequence methylation data of the Solo-WCGW motif sequences across a set of samples, or by assessing, at the data processing apparatus, the covariance between multiple Solo-WCGW motif sequences across a set of samples). In the methods, the SD of solo-WCGW PMD hypomethylation may be bimodally distributed within 100-kb bins. In the methods, the at least one PMD may be: a common PMD shared between or among a plurality of different cell or tissue types; a common PMD shared between or among normal and cancer cell or tissue types; or a common PMD shared between most healthy mammalian tissue types starting from fetal development. In the methods, the at least one PMD may be a cell-type invariant PMD, or a cell-type-specific PMD. In the methods, the replication-associated DNA methylation loss may reflect a cell-type specific replicative/mitotic turnover rate. In the methods, the cumulative number of cell divisions, or the mitotic history, may be from an early stage of embryonic development. In the methods, the replication-associated DNA methylation loss may reflect the chronological age of the cell or tissue sample. In the methods, the cell or tissue sample may be a cancer cell or cancer tissue sample. In the methods, the genomic DNA derived from a cell or tissue sample may comprise genomic DNA derived from tissue biopsies, or cell-free DNA derived from blood or other non-invasive samples including but not limited to urine, stool, saliva, etc. In the methods, the plurality of Solo-WCGW motif sequences of the at least one PMDs may be a number selected from at least 5, at least 10, at least 100, at least 500, at least 1,000, at least 1,500, at least 2,000, at least 5,000, and at least 10,000 or greater. In the methods, obtaining CpG dinucleotide sequence methylation data may comprise obtaining CpG dinucleotide sequence methylation data from less than a complete genomic read. In the methods, obtaining CpG dinucleotide sequence methylation data may be from the genomic DNA of a single cell. In the methods, the amount of replication-associated DNA methylation loss may vary between cell types or tissue types, reflecting a cell-type or tissue-type specific rate of replication-associated DNA methylation loss. In the methods, the plurality of Solo-WCGW motif sequences of the at least one PMDs may comprise hypomethylation prone Solo-WCGW sequence motifs selected to minimize propeller twist DNA shape. In the methods, cell-type or tissue-type specific rates of replication-associated DNA methylation loss may be used to infer the presence of one or more highly replicative cell types within a sample containing multiple cell types. The methods may, for example, comprise inferring the presence of genomic DNA of a highly replicative target cell type within a sample containing genomic DNA of multiple cell types, based on a target cell-type specific rate of replication-associated DNA methylation loss.
Additional aspects provide a method for identification of replication-associated DNA methylation loss of a target cell type in a sample containing genomic DNA of multiple cell types, comprising: a) identifying a test sample containing genomic DNA of multiple cell types including genomic DNA of a target cell type; and b) determining, at data processing apparatus, for the genomic DNA from the test sample, replication-associated DNA methylation loss according to the methods disclosed herein, wherein the at least one PMD comprises a target cell-type specific PMD to provide a measure of target cell-type specific replication-associated DNA methylation loss. In the methods, the presence of genomic DNA of the target cell may be identified at the data processing apparatus based on the presence of the target cell-type specific replication-associated DNA methylation loss. In the methods, the at least one PMD may comprise a cell-type specific PMD for the target cell type, and for each of other cell types of the sample to provide a measure of cell-type specific replication-associated DNA methylation loss for the target cell, and for each of the other cell types of the sample. In the methods, the presence of the genomic DNA of the multiple cells types may be identified at the data processing apparatus based on the presence of the respective cell-type specific replication-associated DNA methylation losses. The methods may further comprise identification at the data processing apparatus of the most hypomethylated cell types in the sample, based on the respective cell-type specific replication-associated DNA methylation losses. In the methods, the genomic DNA may comprise genomic DNA derived from tissue biopsies, or cell-free DNA derived from blood or other non-invasive samples including but not limited to urine, stool, saliva, etc.
Additional aspects provide a method for providing a measure of a mitotic history/age of a cell or tissue sample, comprising: a) identifying a test cell or tissue sample for which a determination of mitotic history/age is desired; and b) determining, at data processing apparatus, for genomic DNA from the test cell or the test tissue sample, replication-associated DNA methylation loss according to the methods described herein to provide a measure of mitotic history/age for the test cell or test tissue (test mitotic age). The methods may further comprise comparing, at the data processing apparatus, the measure of mitotic history/age of the test cell or test tissue determined in step b) with one or more control mitotic history/age values obtained, using the same method used in step b), for genomic DNA of a normal matched cell/tissue having a known replicative history, and assigning a mitotic history/age to the test cell or the test tissue. In the methods, the normal matched cell/tissue having a known replicative history may comprise a primary cell line or an immortalized primary cell line, for which mitotic history/age has been calibrated with respect to passage number using the methods disclosed herein. In the methods, the determined mitotic history/age of the cell or the tissue may be a cell type-specific or tissue type-specific mitotic history/age.
Additional aspects provide a method for determining a chronological age of a cell or tissue sample, comprising: a) identifying a test cell or tissue sample for which a determination of chronological age is desired; b) determining, at data processing apparatus, for genomic DNA from the test cell or the test tissue sample, replication-associated DNA methylation loss according to the methods disclosed herein to provide a measure of mitotic history/age for the test cell or test tissue (test mitotic age); and c) determining a chronological age for the test cell or test tissue by comparing, at data the processing apparatus, the test mitotic age with one or more control mitotic age values obtained, using the same method used in a), for genomic DNA of a normal, cell-matched and/or tissue-matched control population calculated, at the data processing apparatus, over a chronological age range, and assigning a chronological age to the test cell or the test tissue. In the methods, the actual chronological age of the test cell or test sample may be known and may be less than the chronological age determined in step b), providing a measure of accelerated aging. The methods may be part of a forensic analysis.
Additional aspects provide a method for determining increased risk for conditions associated with excessive replicative turnover or aging, comprising: a) identifying a test cell or tissue sample for which a determining increased risk for conditions associated with excessive replicative turnover or aging is desired; b) measuring, at data processing apparatus, for genomic DNA from the test cell or the test tissue sample having a known chronological age, replication-associated DNA methylation loss according to the methods disclosed herein to provide a measure of mitotic age for the test cell or test tissue (test mitotic age); and c) determining that there is an increased risk for conditions associated with excessive replicative turnover or aging by comparing, at the data processing apparatus, the test mitotic age with control mitotic age values obtained, using the same method used in a), for the genomic DNA of a normal, cell-matched or tissue-matched control population having the same chronological age as the test cell or test tissue, and finding, at the data processing apparatus, that the test mitotic age is greater than the aged-matched control mitotic age. In the methods, the condition associated with excessive replicative turnover or aging may be selected from the group consisting of cancer, neurodegenerative disease, cardiovascular disease, gastrointestinal disease, auto-immune diseases, and progeria.
Additional aspects provide a method for determining increased risk of a subject for conditions associated with excessive replicative turnover or aging, comprising: a) determining, at data processing apparatus, replication-associated genomic DNA methylation loss for a test cell or test tissue of a test subject; and b) comparing, at the data processing apparatus, the replication-associated genomic DNA methylation loss determined in a) with that of an age-matched normal control cell or tissue; and c) based on the comparison in part b), concluding, at the data processing apparatus, that a subject having greater replication-associated genomic DNA methylation loss compared to that of the age-matched control is a subject having an increased risk for conditions associated with excessive replicative turnover or aging, wherein the replication-associated genomic DNA methylation loss is determined by the methods disclosed herein. In the methods, the condition associated with excessive replicative turnover or aging may be selected from the group consisting of cancer, neurodegenerative disease, cardiovascular disease, gastrointestinal disease, auto-immune diseases and progeria.
Yet additional aspects provide a method of assessing methylation maintenance in stem cells, comprising: identifying a test stem cell sample; determining, at data processing apparatus, a measure of replication-associated genomic DNA methylation loss by the method disclosed herein; and based on the measure of replication-associated genomic DNA methylation loss, concluding, at the data processing apparatus, the degree of methylation maintenance by comparison with a normal control stem cell methylation value. In the methods, the stem cell may be selected from the group consisting of embryonic stem cells (ESC), induced pluripotent stem cells (iPSC) and mesenchymal stem cells (MSCs).
Further aspects provide a method for structurally defining a partially methylated domain (PMD) of genomic DNA, comprising: a) identifying a genomic DNA for which at least one PMD structural determination is desired; b) obtaining, at the data processing apparatus, CpG dinucleotide sequence methylation data for the genomic DNA, wherein each such CpG dinucleotide is the sole CpG dinucleotide sequence within a n(x)WCpGWn(x) genomic DNA sequence motif (Solo-WCGW motif) of at least one PMD, and wherein W=A or T, n=A or G or C or T, and x≥9 (with x varying as givem above for the general methods); and c) determining, at the data processing apparatus, a PMD structure based on the CpG dinucleotide sequence methylation data. In the methods, the at least one PMD may be, at least in part, defined by assessing, at the data processing apparatus, the standard deviation (SD) of the CpG dinucleotide sequence methylation data of the Solo-WCGW motif sequences. In the methods, the SD of solo-WCGW PMD hypomethylation may be bimodally distributed within 100-kb bins.
Yet further aspects provide a method for developing a mitotic clock, including: (a) identifying a test cell for which a determination of a mitotic clock is desired; (b) providing conditions for the test cell to divide; (c) determining the number of effective cell divisions in the test cell at one or more timepoints; (d) obtaining, at data processing apparatus, CpG dinucleotide sequence methylation data for genomic DNA derived from the test cell at the timepoints, wherein the genomic DNA comprises highly methylated domains (HMD) and partially methylated domains (PMD), wherein each such CpG dinucleotide is the sole CpG dinucleotide sequence within a n(x)WCpGWn(x) genomic DNA sequence motif (Solo-WCGW motif) of at least one PMD, and wherein W=A or T, n=A or G or C or T, and x≥9; (e) based on the CpG dinucleotide sequence methylation data, determining, at the data processing apparatus, a mean or average CpG dinucleotide methylation value or a value related thereto at each of the timepoints for a plurality of Solo-WCGW motif sequences of the at least one PMDs, to provide a measure of cellular replication-associated DNA methylation loss at each of the timepoints; (f) correlating, at the data processing apparatus, the effective cell divisions at each of the timepoints with the measure of cellular replication-associated DNA methylation loss at each of the timepoints; and (g) if the correlation from correlating step is statistically significant, identifying the measure of cellular replication-associated DNA methylation loss as a mitotic clock.
In additional aspects, the correlating step may include calculating regression at the data processing apparatus and, for example, the regression calculation may be determined by an elastic net regression model or an independent regression model.
In yet further aspects, each of the one or more timepoints may be a cell passage in vitro or changes (e.g. increases) of a cell mass in vivo. In one aspect, the conditions for the division of the test cell may include passing the test cell to certain passage numbers, wherein the timepoints are the passages numbers.
In an additional aspect, the method may include extracting DNA at each passage number and performing bisulfate conversion and library preparation and/or, at the data processing apparatus, determining a passage number calibration curve.
Further, in one aspect, the determining step may include measuring the volume of the cell mass at the one or more timepoints, wherein a change (e.g., an increase) in the volume of the cell mass across the timepoints reflects an increase in the number of effective cell divisions.
This patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
FIGS. 3A1-3A2, 3B-E show, according to particular exemplary aspects, that most PMDs are shared across developmental lineages in humans.
FIGS. 10A1-10A3, 10B1-10B2 show, according to particular exemplary aspects, that the same sequence dependencies shown in
According to particular surprising aspects of the present invention, four distinct features were identified that influence DNA methylation levels in large portions of the human and mouse genomes: First, the local sequence context of the CpG dinucleotide; second, the timing of DNA replication; third, the presence of the H3K36me3 histone mark; and fourth, the accumulated number of cell divisions.
According to additional aspects, the sequence context, replication timing, and H3K36me3 marks each confer differential susceptibility to replication-associated DNA methylation loss, and thus collectively shape PMD/HMD structure, while the degree of PMD hypomethylation is a function of the cumulative number of cell divisions from the earliest stages of embryonic development.
According to particular aspects, two local sequence features (CpG density and the WCGW sequence context) were shown to exert a strong influence on the rate of DNA methylation loss at individual CpGs within PMDs, and that these influences are consistent across cell types and species.
The bulk of DNA methylation maintenance is performed by DNMT1 and augmented by DNMT3A/B48. DNMT1 has been shown to act processively, with increased efficiency in the presence of multiple CpG sites in close proximity (49), a feature consistent with the poorer methylation maintenance of “solo” CpGs (
According to additional aspects, the in vivo effects of a WCGW motif disclosed herein on methylation maintenance efficiency provide for careful mechanistic studies to identify the causative factor or factors.
According to further aspects, the Solo-WCGW signature, developed and disclosed herein, allowed for the improved analysis of HMD/PMD structure (and the shared PMD signatures) also disclosed herein, leading to better characterization of not just the “common PMDs” disclosed here, but also important classes of cell-type-specific PMDs (6, 7, 14, 52) (see working Example 10 below).
According to additional aspects of the present invention, most Solo-WCGW are not marked by H3K36me3, and replication timing was identified as the major determinant for methylation levels at these H3K36me3-negative CpGs. According to certain aspects, and while not being bound by mechanism, replication late in S phase provides the cell with less time for re-methylation of newly synthesized daughter strands during DNA replication (
According to yet additional aspects of the present invention, the presence of H3K36me3 overrides this late-replication associated methylation loss at Solo-WCGW CpGs (
A number of studies have identified specific CpGs predictive of chronological age (58-60) as well as gestation age at birth (61). However, these signatures are largely non-overlapping with PMDs, as shown in earlier work (26) and with the PMD solo-WCGWs identified here. According to particular aspects of the present invention, this is because the presently disclosed PMD hypomethylation captures underlying mitotic dynamics, which are only loosely associated with chronological age per se. Organismal aging and the associated physiological changes affect transcriptional regulation of various genes and pathways, and many or most of the loci identified on the basis of age alone (58-60) likely represent transcriptionally-coupled chromatin changes at these genes (for example, changes to Somatostatin which regulated growth hormone (58)). According to particular aspects, as shown herein, PMD hypomethylation is likely a more direct clock-like readout of mitotic age, which is generally correlated with chronological age but can be accelerated by environmental factors or processes that promote cell turnover, such as cellular damage, wounding, inflammation, etc.
DNA hypomethylation has long been proposed to allow the aberrant expression and transposition of retroelements that can play a role in cancer by inducing chromosomal aberrations at the point of insertion (62-66). Genetically engineered Dnmt1 hypomorphism in mouse was shown to cause lymphomas frequently harboring retrotranspon-induced Notchl activation events (43). Whole-genome sequencing has shown that approximately 50% of human tumors contain somatic retrotranspositions of LINE-1 elements, and that these often lead to structural alterations (39, 40, 67, 68) enriched within PMDs39. In one study, human lung tumors exhibiting mobilization of LINE-1 elements shared a common DNA hypomethylation signature (42).
According to additional aspects of the present invention, as shown herein across a large TCGA cohort, tumors with higher degrees of PMD hypomethylation are more likely to have LINE-1 insertions, and these insertions are more likely to occur within PMDs (
The methylation loss process described and disclosed herein affects a sizeable fraction of all CpGs in the genome, and thus could exert a significant influence on methylation-dependent mutational processes, most importantly CpG to TpG substitutions driven by methylation-dependent deamination of CpGs. This mutational signature accounts for a large fraction of single nucleotide mutations observed in both evolution and cancer, and thus systematic DNA methylation changes might be expected to influence the rate of these mutations. According to particular aspects, hypomethylated solo-WCGWs within late replicating PMDs are protected from deamination and thus have a lower CpG to TpG mutation rate. Indeed, we observed evidence in support of this model for both somatic mutations (from tumor sequencing) and de novo mutations in the human germline (from whole-genome trio sequencing) were observed herein (
According to particular aspects, working Example 1 below describes the definition and use of a Solo-WCGW sequence motif having substantial utility for measuring genomic DNA methylation loss. Solo-WCGW CpGs were shown herein to be prone to hypomethylation. A set of shared partially methylated domains (PMDs) and highly methylated domains (HMDs) was initially defined across the majority of a 49 core sample set (40 core tumor samples and 9 core normal samples) (
According to additional aspects, working Example 2 below describes data showing that most PMDs were shown to be shared across cancer and normal tissues. Genome-wide, standard deviation SD of solo-WCGW PMD hypomethylation was bimodally distributed within 100-kb bins in both normal and tumor core groups (
According to additional aspects, working Example 3 below describes data showing that most PMDs where shown to be shared across developmental lineages. The findings support the idea, according to particular aspect of the present invention, that a large set of cell-type-invariant PMDs dominate the hypomethylation landscape in most tissues.
According to additional aspects, working Example 4 below describes data showing that PMD hypomethylation emerges during embryonic development. The substantial similarity of PMD structure detected between ICMs, ESCs, embryonic (<8 weeks) stages, and post-natal samples, suggests that PMD hypomethylation begins at the earliest stages of development. This interpretation is strengthened by the observation that the degree of hypomethylation observed at the fetal and postnatal stages for each cell type largely mirror the lineage-specific hypomethylation rate within the same embryonic cell type.
According to additional aspects, working Example 5 below describes data showing that PMD hypomethylation is associated with chronological age. A strong age association was evident from the WGBS profile of sorted CD4+ T cells from a newborn vs. those from a 103-year-old individual, with the latter being closer to a T cell-derived leukemia than to the newborn sample (
According to additional aspects, working Example 6 below describes data showing that PMD hypomethylation is linked to mitotic cell division in cancer. PMD hypomethylation was nearly universal but showed extensive variation both within and across cancer types. Comparison to 749 adjacent normals from TCGA showed that the relative degree of hypomethylation across cancer types was correlated with that of the disease-free tissue of origin (
According to additional aspects, working Example 7 below describes data showing that both replication timing and H3K36me3 were shown to affect methylation. IMR90 cells, for which there is publicly available data for all relevant histone and topological marks, was used to systematically analyze the presently disclosed solo-WCGW based PMD definition. This analysis confirmed that HMD/PMD structure coincided with nuclear architecture, as characterized by Hi-C A/B compartments, Lamin B1 distribution and replication timing (
According to additional aspects, working Example 8 below describes the materials and methods used in the presently disclosed work, including whole genome bisulfite sequencing, external data, alignment and extraction of methyl-cytosine levels, genomic binning, definition of preliminary PMD/HMD domains. final definition of PMDs/HMDs based on standard deviation of solo-WCGW methylation, HM450 analysis, analysis of the IMR90 epigenome, rescaling based on PMD methylation, stratified analysis of solo-WCGW CpGs in the genome, statistics, data availability, code availability, and URLs).
According to additional aspects, working Example 9 below describes data showing that PMD hypomethylation in immortalized cell lines was demonstrated using the solo-WCGW motif. PMD hypomethylation was observed in almost all cultured cell lines except for ESCs, iPSCs and their derived cell lines (
According to additional aspects, working Example 10 below describes data showing that improved analysis of HMD/PMD structure was obtained using the solo-WCGW motif. Cell-type invariant PMDs were useful for investigating general properties of methylation loss over time. PMDs were defined in the present work by exploiting the inherent variance in PMD hypomethylation levels across large cohorts of samples, which was the only cross-sample feature bimodally distributed between HMDs and PMDs. Under this definition, for example, the core tumor group (containing only solid tumors) had almost the same degree of shared PMDs with blood malignancies (82%) as it did with other solid tumors not from the core set (85%) (
According to additional aspects, working Example 11 below describes data showing that the stability of rank-based correlation between methylomes was demonstrated using the solo-WCGW motif. A rank-based analysis of 792 genomic 100 kb bins from chromosome 16 (
According to additional aspects, working Example 12 below discusses an alternative nuclear localization model (
According to additional aspects, working Example 13 below assesses the relevance of the PMD sequence signature to somatic and germline mutational landscape.
To investigate any potential impact of the PMD sequence signature on introducing cytosine deamination mutations in the CpG dinucleotides, the relative proportion of somatic mutations that are within certain tetranucleotide sequence contexts and certain numbers of neighboring CpGs was studied. Somatic CpG to TpG mutations reported in an early gastric cancer whole-genome sequencing experiment was compared, and indeed confirmed that solo-WCGWs within late replicating PMDs had a lower CpG to TpG mutation rate compared with other sequence context (
According to additional aspects, working Example 14 below, certain specific sub-patterns that match the Solo-WCGW definition were found to be more predictive than the general definition, and DNA shape features were also found to be predictive. According to additional aspects, therefore, more specific definitions and structures within the general Solo-WCGW pattern are provided for tracking replication-associated DNA methylation loss.
According to additional aspects, working Example 15 below describes the materials and methods used in the presently disclosed Examples 16-18, including primary cell culture, DNA methylation assay, Beta calling, QA/NA Removal, and Solo-WCGW subsetting.
According to additional aspects, working Example 16 below describes using an elastic net modeling strategy to identify a 44 CpG model for predicting mitotic history with and between cell types.
According to additional aspects, working Example 17 below describes using an individual probe regression strategy to identify 75 correlated probes for all tissue types studied.
According to additional aspects, working Example 18 below describes a comparison to the results of using the elastic net modeling strategy and individual probe regression strategy.
According to additional aspects, working Example 19 below describes a comparison of the solo-WCGW mitotic clock to existing clocks, including conception, model building and application.
According to additional aspects, working Example 20 below, the disclosed methods for measuring and tracking replication-associated DNA methylation loss are broadly applicable, and additional, non-limiting exemplary applications are provided.
Unless otherwise explained, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is to be understood that the methods and systems are not limited to specific methods, specific components, or to particular compositions. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
As used in the specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another embodiment. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint.
“Optional” or “optionally” means that the subsequently described event or circumstance may or may not occur, and that the description includes instances where said event or circumstance occurs and instances where it does not. “On the order of” can mean approximately, a fraction thereof, or a multiple thereof.
Throughout the description and claims of this specification, the word “comprise” and variations of the word, such as “comprising” and “comprises,” means “including but not limited to,” and is not intended to exclude, for example, other additives, components, integers or steps. “Exemplary” means “an example of” and is not intended to convey an indication of a preferred or ideal embodiment. “Such as” is not used in a restrictive sense, but for explanatory purposes.
Ranges can be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another aspect includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another aspect. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as “about” that particular value in addition to the value itself. For example, if the value “10” is disclosed, then “about 10” is also disclosed. It is also understood that each unit between two particular units are also disclosed. For example, if 10 and 15 are disclosed, then 11, 12, 13, and 14 are also disclosed. All ranges disclosed herein are inclusive and combinable (e.g., ranges of “up to 25%, or, more specifically 5% to 20%” is inclusive of the endpoints and all intermediate values of the ranges of “5% to 25%,” etc.).
The terms “first,” “second,” “first part,” “second part,” and the like, where used herein, do not denote any order, quantity, or importance, and are used to distinguish one element from another, unless specifically stated otherwise.
As used herein, the terms “optional” or “optionally” means that the subsequently described event or circumstance can or cannot occur, and that the description includes instances where said event or circumstance occurs and instances where it does not.
The sequence “WCGW” as used herein refers to a CpG dinucleotide sequence flanked by either A or T (e.g., ACGA, ACGT, TCGT, TCGA). According to particular aspects of the present invention, preferred WCGW sequences are those located in sequence motifs (e.g., ≥22 bp) characterized by specific G/C content and/or having only one or a few CpG dinucletides. For example, preferred aspects of the present methods comprise determining a mean or average methylation value, or a value related thereto, for a plurality of genomic CpG dinucleotide sequences, wherein each such CpG dinucleotide is the sole CpG dinucleotide sequence within a n(x)WCpGWn(x) genomic DNA sequence motif, wherein W=A or T, n=A or G or C or T, and wherein x≥9, to provide a measure of cellular replication-associated DNA methylation loss. In preferred aspects, xis a value selected from the group consisting of at least 9, at least 14, at least 19, at least 24, at least 29, at least 34, at least 39, at least 44, at least 49, at least 54, at least 59, about 34, 34±25, 34±15, or x is a value in a range selected from the group consisting of about 9-49, 9-99, 9-149, 9-199, 14-49, 14-99, 14-149, 14-199, 19-49, 19-99, 19-149, 19-199, 24-49, 24-99, 24-149, 24-199, 29-49, 29-99, 29-149, 29-199, 34-49, 34-99, 34-149, 34-199, 39-49, 39-99, 39, 149, 39-199, 44-49, 44-99, 44-149, 44-199, 49-99, 49-149, 49-199, 54-99, 54-149, 54-199, 59-99, 59-149, 59-199 and any subranges of the preceding ranges. Preferably, x is 34 (or about 34), or 34±25 (e.g., in the range of 9-59) or 34±15 (e.g., in the range of 19-49).
“Solo-WCGW” refers to a n(x)WCpGWn(x) genomic DNA sequence motif wherein the CpG dinucleotide of the WCGW sequence is the sole CgG dinucleotide sequence in the n(x)WCpGWn(x) genomic DNA sequence motif, wherein W, n and x are defined as in the preceding paragraph. Preferred solo-WCGW genomic DNA sequence motifs are those wherein x is 34 (or about 34), or 34±15 (e.g., in the range of 19-49), however less favored aspects of the methods may include x in a value range selected from 9 to 199 as described in the preceding paragraph.
In particular aspects, the Solo-WCGW motif may comprise the sequence n(x−1)mWCpGWGn(x−1), and wherein W=A or T, n=A or G or C or T, m=C or A, and x≥9 (with x varying as describe above in the preceding paragraphs). In the methods, the Solo-WCGW motif may comprise the sequence n(x−1)CWCpGWGn(x−1), and wherein W=A or T, n=A or G or C or T, and x≥9 (with x varying as describe above in the preceding paragraphs).
Exemplary human and mouse n(x)WCpGWn(x) genomic DNA sequence motif species are provided in Tables 4-7 below.
In particular, less favored, aspects of the methods, the n(x)WCpGWn(x) genomic DNA sequence motif may comprise 1 or 2 CpG dinucleotide sequences in addition to the CpG dinucleotide sequence of the WCGW sequence. In such aspects, x is a value selected from the group consisting of at least 9, at least 14, at least 19, at least 24, at least 29, at least 34, at least 39, at least 44, at least 49, at least 54, at least 59, about 34, 34±25, 34±15, or x is a value in a range selected from the group consisting of about 9-49, 9-99, 9-149, 9-199, 14-49, 14-99, 14-149, 14-199, 19-49, 19-99, 19-149, 19-199, 24-49, 24-99, 24-149, 24-199, 29-49, 29-99, 29-149, 29-199, 34-49, 34-99, 34-149, 34-199, 39-49, 39-99, 39-149, 39-199, 44-49, 44-99, 44-149, 44-199, 49-99, 49-149, 49-199, 54-99, 54-149, 54-199, 59-99, 59-149, 59-199 and any ranges or subranges of the preceding ranges. In particular of such aspects, x is 34 (or about 34), or 34±25 (e.g., in the range of 9-59) or 34±15 (e.g., in the range of 19-49).
For purposes of the presently disclosed methods, in the context of the various above-described n(x)WCpGWn(x) genomic DNA sequence motifs, certain instances of the motif are more predictive (e.g., for tracking replication-associated DNA methylation loss) than others. In our analysis, Solo-WCGWs (as described above) in the contexts ACGA, TCGA, and ACGT are not equally predictive for tracking replication-associated DNA methylation loss.
As used herein, “condition or state” of a test cell or tissue sample means the health of a cell or tissue, including, for example, the condition or state of a normal (healthy) cell or tissue, a diseased cell or tissue, and/or a cell or tissue showing some signs indicative of a diseased state. In one example, the condition or state are signs indicative of the beginning of a diseased state and/or the progression or advancement towards a diseased state. The “condition or state” of a test cell or tissue sample also includes the type of cell or tissue, for example, the developmental stage of a particular cell or tissue type (embryonic, fetal, neonatal, adult), and the differentiated type of cell of tissue, for example, a liver cell, lung cell, brain cell.
As used herein, the term “effective cell division” or “effective cell divisions” means the process of dividing a parent cell into two new identical daughter cells, each daughter cell including the same number of chromosomes and genetic content as that of the parent cell. In one aspect, effective cell division may refer to the number of nuclear divisions when a eukaryotic cell reproduces during maintenance or growth.
As used herein, “determining the number of effective cell divisions” means determining the number of cells present after effective cell division(s). In one aspect, in the in vitro environment, the number of cells present after division(s) of a test cell can be determined by serially measuring the growth of the cell culture with a count slide (or hemacytometer) and a microscope, or with a spectrophotometer. In another aspect, stains are used to distinguish viable from non-viable cells to account for rates of cell death.
In one aspect, as used with Examples 15-18 below, the number of effective cell divisions may be determined according to the following methods. Primary cells are maintained under pro-mitotic conditions using optimal media formulations as recommended by the vendor (Coriell). The neonatal fibroblast lines (AG21859, AG21839) are cultured in 1:1 Ham's F12: Dulbecco Modified Eagle's Medium, with 2 mM L-glutamine, 15% v/v fetal bovine serum (FBS), and 1% v/v penicillin-streptomycin. The adult fibroblast line (AG16146) is cultured in Eagle's Minimum Essential Medium with Earle's salts, 1% v/v non-essential amino acids, 10% FBS v/v, and 1% v/v penicillin-streptomycin. The adult vascular smooth muscle line (AG21546) is cultured in Medium 199 in Earl BSS, with 2 mM L-glutamine, 10% FBS v/v, 0.02 mg/ml Endothelial Cell Growth Supplement, 0.05 mg/ml Heparin, and 1% v/v penicillin-streptomycin. Culture dishes are first coated with sterile gelatin (0.1% w/v) before seeding; this facilitates attachment and growth. The adult endothelial line (AG11182) is cultured under identical conditions to the vascular smooth muscle cell line (AG11546) except 15% v/v FBS is included. All primary cell lines are maintained at 37° C. at 5% CO2. Media is aspirated and replaced every 2-3 days. Replicative senescence is defined qualitatively as the inability to reach confluence at two weeks following the most recent passaging event, or >60% non-viable cells as quantified below.
Cells are counted using an automated cell counter (BioRad TC20). Briefly, 10 ul of a suspension of cells are retained at each passage. An equal volume (10 ul) of 0.40% Trypan Blue Dye is added to and gently mixed with the cell suspension. The addition of Trypan Blue Dye allows for detection of the live/dead cell fraction; dead cells are stained and live cells are not. Ten microliters of the stained cell suspension is applied to both chambers of a double-sided hemocytometer/counting slide. Both sides are read by an automated cell counter (BioRad TC20) and the average live/dead cell counts is calculated.
Population doubling level (PDL) is a standard method for quantifying mitoses within a population, given the initial seeding density and the final cell count at harvest. PDL for a given passage is calculated as followed:
This is a derivative equation of the binary fission equation: x=2n wherein x=final cell count and n=number of population doublings. The multiplier 3.32 is introduced by converting from
To calculate the total mitotic history, the sum of total PDLs (from passage 1 onward) is taken:
Total PDL=Σpassage 1passage nPDL
The vendor (Coriell) may provide a starting PDL for primary cell lines that are established in their facilities; this is also included in the cumulative PDL.
In another aspect, in an in vivo environment, the number of cells present after cell division(s) can be determined by serially measuring the change in volume of a cell mass of a test cell or cells, or test cell tissue that has been grafted onto the animal, e.g., a mouse or other rodent.
As used herein “conditions for the test cell to divide” means conditions for effective cell division; and such conditions can be provided either in an in vitro environment or an in vivo environment. In vitro, in one embodiment, the conditions for a test cell to divide may include a culture plate containing a solid or liquid media or agar. In one aspect, conditions for encouraging a test cell to divide in vitro in the media/agar include providing a nutrient-rich broth in the media/agar along with, in some instances, antibiotics to promote cell growth; and providing temperature conditions favorable for cell growth (for example, 37° C.). In vivo, in one embodiment, the conditions for a test cell to divide may include providing an animal (e.g., a mouse, rat, or other animal) and grafting one or more test cells, or cell tissue, onto the animal. In one aspect, conditions for encouraging a test cell to divide in vivo include providing food, water and nutrients to the animal and, in some instances, antibiotics to promote growth of the animal; and temperature conditions favorable for growth of the animal (for example, 23° C.).
As used herein, “cell passaging” or “passaging” is a process for subculturing cells under physiological and environmental conditions to keep the cells alive for periods of time, sometimes extended periods of time. And as used herein, “passage number” or “cell passage” means the number of times a cell culture has been subcultured (harvested and transferred) into daughter cell cultures.
As used herein, “timepoint” or “timepoints” means the moment in time when a particular action occurs, for example, the transfer of cells to a new cell culture plate in cell passaging.
In one aspect, the method described herein provide for statistical methods to estimate of the probability of a degree of association between variables; and statistical significance can be expressed, in terms of p-value. As used herein, in one aspect, “statistically significant” means a p-value that is less than 0.05 or, alternatively is less than 0.01, 0.005, or 0.001.
The term “mitotic clock” means a series of similar events which occur in a DNA replication-dependent manner. One example of a mitotic clock is the loss of a small amount of DNA following each round of DNA replication due to the inability of DNA polymerase to fully replicate chromosome ends (telomeres). Other mitotic clocks are described hereinbelow in the Examples. As used herein, “mitotic clock” means a change (e.g. increase) in the DNA hypomethylation level with each round of DNA replication.
As used herein “cell mass” means a mass or grouping of cells that originate from a parent cell.
Another aspect is a method for developing a mitotic clock, including (a) identifying a test cell for which a determination of a mitotic clock is desired; (b) providing conditions for the test cell to divide; (c) determining the number of effective cell divisions in the test cell at one or more timepoints; (d) using data processing apparatus to obtain CpG dinucleotide sequence methylation data for genomic DNA derived from the test cell at the timepoints, wherein the genomic DNA comprises highly methylated domains (HMD) and partially methylated domains (PMD), wherein each such CpG dinucleotide is the sole CpG dinucleotide sequence within a n(x)WCpGWn(x) genomic DNA sequence motif (Solo-WCGW motif) of at least one PMD, and wherein W=A or T, n=A or G or C or T, and x≥9; (e) using the data processing apparatus to determine, based on the CpG dinucleotide sequence methylation data, a mean or average CpG dinucleotide methylation value or a value related thereto at each of the timepoints for a plurality of Solo-WCGW motif sequences of the at least one PMDs, to provide a measure of cellular replication-associated DNA methylation loss at each of the timepoints; (f) using the data processing apparatus to correlate the effective cell divisions at each of the timepoints with the measure of cellular replication-associated DNA methylation loss at each of the timepoints; and (g) if the correlation is statistically significant, identifying the measure of cellular replication-associated DNA methylation loss as a mitotic clock.
In some aspects, data processing apparatus is used to implement various aspects of the inventive method. For instance, the user may provide data input or selections to software being executed by the data processing apparatus. In some aspects of the present inventive methods, data processing apparatus is used because of the need for computing power to manipulate and analyze the large amount of data associated with measuring replication-associated DNA methylation loss. More specifically, it would not be humanly practical to digest and calculate replication-associated DNA methylation loss without errors. Using data processing apparatus, instead of a human, to perform repeated calculations, the calculations would be systematically accurate and reliable; an aspect of considerable importance to discerning cellular replicative/mitotic history, mitotic turnover rate, chronological age of a cell or tissue, increased risk for conditions associated with excessive replicative turnover or aging, identification of subjects for increased surveillance, cancer screening, forensic analysis, etc.
Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Moreover, subject matter described in this specification may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter affecting a machine-readable propagated signal, or a combination of one or more of them. The terms “data processing apparatus”, “computing device” and “computing processor” encompass all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
A computer program (also known as an application, program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
One or more aspects of the disclosure can be implemented in a computing system that includes a backend component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a frontend component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such backend, middleware, or frontend components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some implementations, a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server.
The human and mouse Genome Assemblies GRCh37 and GRCm38 used for the present work are summarized below in Tables 2 and 3, respectively.
Exemplary, representative human and mouse n(x)WCpGWn(x) genomic DNA sequence motif species, wherein W=A or T, n=A or G or C or T, and wherein x=35 are provided below in Tables 4 and 5 (human) and Tables 6 and 7 (mouse).
Tables 8 and 9 list exemplary probes with extension base targeting CpG dinucleotide sequences in the respective exemplary human Solo-WCGW motif sequences listed in Tables 4 and 5, respectively.
Tables 10 and 11 list exemplary probes with extension base targeting CpG dinucleotide sequences in the respective exemplary mouse Solo-WCGW motif sequences listed in Tables 6 and 7, respectively.
Table 12 lists primary human cells obtained from multiple tissues and donors.
Table 13 lists 44 CpGs and coefficients selected by elastic net regression of solo-WCGW CpG beta values from serial primary cell culture to standardized population doubling level.
Table 14 is a summary of predictive performance of various methylation clocks on training dataset from primary cells.
Tables 15A-B list the CpGs in a 44-CpG model for predicting mitotic history within and between cell types.
Tables 16A-B list a subset of 75 strongly correlated CpGs for all tissue types studied.
General
Regions
Alternate Loci and Patches
General
Regions
Alternate Loci and Patches
TABLES 15A-B. 44-CpG model. The human reference sequence version is GRCh37 (hg19). Specific chromosome accession numbers can be found at https://www. ncbi.nlm.nih.gov/grc/human/data?asm=GRCh37.
TABLES 16A-B. 75-CpG Subset. The human reference sequence version is GRCh37 (hg19). Specific chromosome accession numbers can be found at https://www. ncbi.nlm.nih.gov/grc/human/data?asm=GRCh37.
WGBS means Whole-Genome Bisulfite Sequencing as recognized in the art (6).
“TCGA” as referred to herein, means The Cancer Genome Atlas (TCGA). TCGA is supervised by the National Cancer Institute's Center for Cancer Genomics and the National Human Genome Research Institute funded by the US government. A three-year pilot project, begun in 2006, focused on characterization of three types of human cancers: glioblastoma multiforme, lung, and ovarian cancer. In 2009, it expanded into phase II, which planned to complete the genomic characterization and sequence analysis of 20-25 different tumor types by 2014. TCGA surpassed that goal, characterizing 33 cancer types including 10 rare cancers.
“Hi-C-defined heterochromatic compartment B” as used herein is as recognized in the art, for example, by Fortin, J.-P. & Hansen, K. D. (7).
Disclosed are components that can be used to perform the disclosed methods and systems. These and other components are disclosed herein, and it is understood that when combinations, subsets, interactions, groups, etc. of these components are disclosed that while specific reference of each various individual and collective combinations and permutations of these may not be explicitly disclosed, each is specifically contemplated and described herein, for all methods and systems. This applies to all aspects of this application including, but not limited to, steps in disclosed methods. Thus, if there are a variety of additional steps that can be performed it is understood that each of these additional steps can be performed with any specific embodiment or combination of embodiments of the disclosed methods.
Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of this disclosure, suitable methods and materials are described below. The term “comprises” means “includes.” The abbreviation, “e.g.” is derived from the Latin exempli gratis, and is used herein to indicate a non-limiting example. Thus, the abbreviation “e.g.” is synonymous with the term “for example.”
The present methods and systems may be understood more readily by reference to the following detailed description of preferred embodiments and the Examples included therein and to the Figures and their previous and following description.
(Solo-WCGW CpGs were Shown to be Prone to Hypomethylation)
This example describes definition and use of a Solo-WCGW sequence motif having substantial utility for measuring genomic DNA methylation loss.
TCGA tumors and adjacent normal samples were sequenced using paired-end WGBS at ˜15× sequence depth, to compile a set of 40 core tumor samples and 9 core normal samples (
A set of shared PMDs and HMDs was initially defined across the majority of our 49 core sample set using an existing Hidden Markov Model-based (HMM-based) method, MethPipe27 (
Specifically,
Low CpG density within windows of about +/−35 bp was found to be optimal for predicting PMD-specific hypomethylation (
Specifically, FIGS. 10A1-A3 and B1-B2 show that the same sequence dependencies shown in
Specifically,
Subsequent analyses were focused on solo-WCGWs, representing 13% of all CpGs in the human genome. While they represent only the extreme of a hypomethylation process that affects other CpGs, focusing on solo-WCGWs alone enhanced the signal of PMD/HMD structure, especially in normal adjacent tissues and weakly hypomethylated tumors such as COAD-3518 (
Specifically,
In addition to enhancing the PMD/HMD signal in high coverage WGBS data, solo-WCGW CpGs allowed accurate PMD structure to be determined with average genomic read coverage as low as 0.05× in down-sampled bulk WGBS data (
Specifically,
(Most PMDs were Shown to be Shared Across Cancer and Normal Tissues)
Genomic plots of solo-WCGW CpG mean methylation revealed strong concordance between PMD locations in all samples in the core set (
Given the high variability of solo-WCGW PMD hypomethylation across samples (
Specifically,
Specifically,
Specifically,
(Most PMDs where Shown to be Shared Across Developmental Lineages)
Solo-WCGW PMD structure was also investigated by combining our TCGA dataset with 343 previously published human and 206 mouse WGBS samples (
In agreement with the TCGA tumor-adjacent “normal”, most disease-free post-natal tissues showed PMD structure shared with tumors and other groups (FIG. 3A1-3A2, Group PN and
All nucleated blood cell types showed shared PMD structure, in contrast to an earlier analysis of many of the same WGBS datasets (41) that found PMD hypomethylation to be limited to the lymphoid lineage (FIG. 3A1-3A2, Group PB). Both B cells and T cells could generally be divided into subgroups of strong vs. weak hypomethylation. Those subtypes having undergone antigen presentation and activation (e.g., memory B/T cells, regulatory T cells, germinal center B cells, and plasma cells) fell into the strongly hypomethylated class, while naive B and T cells fell into the weakly hypomethylated class, consistent with earlier reports showing that B and T cell hypomethylation increased during maturation (23, 24). However, unlike these earlier reports, the presently disclosed solo-WCGW analysis showed that PMD hypomethylation was already clearly evident by the naïve stage (FIG. 3A1-3A2 and
Specifically,
The tumor group (TM) consisted of 50 solid tumors (largely lmade up of the 40 core tumors shown previously), plus 50 hematopoietic malignancies (FIG. 3A1-3A2, Group TM). Interestingly, while hematopoietic tumors had more strongly hypomethylated PMDs than normal hematopoietic samples, they generally followed the trend established by their developmental origin: those derived from myeloid cells (AML) had shallower PMDs than those derived from lymphoid cells (CLL, MCL, TPLL, MM) (one-way Wilcoxon test, p=9.69e-7). The notable exception among lymphoid-derived tumors was ALL, which had hypomethylation levels similar to normal lymphoid cells. The lower degree of hypomethylation in ALL (derived from childhood cases) may reflect the generally lower degree of hypomethylation in cells from younger individuals, a topic investigated below.
For five of the six cell type groups (excluding group “GE”), mean methylation across samples in the group (
Specifically,
Specifically,
(PMD Hypomethylation was Shown to Emerge During Embryonic Development))
The presence of PMD hypomethylation in multiple fetal tissue types led to further investigation of solo-WCGW methylation in gametes and early developmental stages (
Embryonic somatic tissues (
Specifically,
Specifically,
(PMD Hypomethylation was Shown to be Associated with Chronological Age)
To investigate the link between PMD-associated hypomethylation and cumulative numbers of cell divisions, the question as to whether solo-WCGW methylation level within common PMDs was associated with donor age in different primary cell types was tested. A strong age association was evident from the WGBS profile of sorted CD4+ T cells from a newborn vs. those from a 103-year-old individual, with the latter being closer to a T cell-derived leukemia than to the newborn sample (
Specifically,
An earlier study used the HM450 platform to investigate the effects of environmental (UV) exposure on PMD hypomethylation in human skin samples (26). While the earlier study described PMD hypomethylation as only occurring within the sun-exposed samples of the epidermal layer, the presently disclosed re-analysis of solo-WCGWs revealed that both dermal and epidermal cells exhibited age-associated PMD hypomethylation without sun exposure, but that this process was dramatically accelerated specifically in epidermal cells upon sun exposure (
HM450 datasets showed that diverse hematopoietic cell types had a significant association between donor age and degree of hypomethylation, with the myeloid lineage (
Specifically,
(PMD Hypomethylation was Shown to be Linked to Mitotic Cell Division in Cancer)
The landscape of cancer hypomethylation in 9,072 tumors from 33 cancer types included in TCGA, was next studied using the HM450 solo-WCGWs located within common PMDs (
Specifically,
Specifically,
Specifically,
Somatic mutation events are known to display mitotic clock-like properties (38). Within TCGA tumors, higher genome-wide somatic mutation densities were found to be significantly associated with deeper PMD hypomethylation, suggesting that mitotic turnover may underlie both somatic mutation and PMD hypomethylation (
PMD hypomethylation was also associated with somatic copy number aberration density (
Specifically,
According to particular aspects of the present invention, tumors highly proliferative at the time of specimen collection may also reflect an extensive history of past cell division. Using TCGA samples with matched gene expression data, the 60 genes most strongly associated with PMD hypomethylation were identified, and it was determined that these genes were most enriched in Gene Ontology functional terms associated with proliferation and mitotic cell division (
Specifically,
(Both Replication Timing and H3K36Me3 were Shown to Affect Methylation)
The one cell type with publicly available data for all relevant histone and topological marks, IMR90, was used to systematically analyze the presently disclosed solo-WCGW based PMD definition. This analysis confirmed previous findings (6, 7) that HMD/PMD structure coincided with nuclear architecture, as characterized by Hi-C A/B compartments, Lamin B1 distribution and replication timing (
Specifically,
The de novo methyltransferase DNMT3B has recently been shown to be guided to transcribed gene bodies via a direct interaction with the H3K36 methylation mark (45). Active genes marked by H3K36me3 are overwhelmingly located in early replicating regions, and it has been suggested that both active transcription of gene bodies and early replication timing contribute to differential methylation throughout the genome (9). To disentangle the contributions of H3K36me3 and replication timing to genome-wide DNA methylation levels and PMDs, a stratified analysis of all solo-WCGW CpGs in the genome (
Specifically,
(Materials and Methods)
Whole Genome Bisulfite Sequencing.
Cases for the WGBS assay were selected from 8 of the most common cancer types (Lung squamous cell carcinoma, Lung adenocarcinoma, Breast, Colorectal, Endometrial, Stomach, Bladder, Glioblastoma). For at least one tumor from each cancer type, we also sequenced its adjacent histologically normal tissue; for the rest, only the tumor was profiled. These samples were combined with one tumor and matched normal colon cancer pair from an earlier study (6), yielding a core set of 40 well characterized tumors and 9 adjacent normal samples (
External Data.
The external human WGBS data consists of 19 germ cells and pre-implantation embryonic tissues, 13 post-implantation embryonic and fetal tissues, 37 cell lines, 59 non-blood normal primary tissues (including normal adjacent tissues of tumors as well as disease-free samples), 154 blood or blood component samples, 11 solid tumors and 50 blood malignancies (
Alignment and Extraction of Methyl-Cytosine Levels.
Reads were aligned to the genome (build GRCh37) using BSmap (71) under the following parameters “−p 27 −s 16 −v 10 −q 2
(3′-end adapter SEQ ID NOS:237 and 238, respectively). Duplicated reads were marked using Picard tools (see URLs, version 1.38). DNA methylation rates and SNP information were called using Bis-SNP (72), using the default easy-run procedure (see URLs). Bis-SNP allows for distinguishing a C->T mutation from bisulfite conversion by investigating the complementary strand. CpGs with fewer than 10 reads' coverage were excluded from analysis.
Genomic Binning.
To show megabase-scale HMD/PMD structures, a 100-kb window size was chosen so that the segments would contain a sufficient number of solo-WCGWs to give reliable methylation averages (
Specifically,
Definition of Preliminary PMD/HMD Domains Based on all CpGs.
WGBS was used at ˜15× coverage to profile methylation patterns of 40 tumors (39 new TCGA samples and one from a prior study (6)) from 8 of the most common cancer types, and tumors were selected on the basis of high cancer cell content (
Final Definition of PMDs/HMDs Based on Standard Deviation of Solo-WCGW Methylation.
Every 100-kb bins are dichotomized into PMD/HMD using a Gaussian mixture model (implemented in the R package mixtools) based on cross-sample SD of beta values from our core tumor samples (N=40). The Gaussian mixture model assumes two subpopulations of 100-kb bins—those located in PMDs with higher cross-sample SDs and those located in HMDs with lower cross-sample SDs. The final threshold of cross-sample SD for classifying PMDs from HMDs is determined to be 0.125. The more conservative sets of “common PMDs” and “common HMDs” are defined by the criteria that SD>0.15 and SD<0.10 respectively. Overlap of PMD boundaries of two samples were measured in the percentage of 100-kb bins identified as both in PMDs and in HMDs in the two samples respectively. The mouse PMDs/HIMDs were defined in the same way using 32 postnatal non-brain WGBS samples (
HM450 Analysis.
For TCGA HM450 data sets, raw IDATs were preprocessed by first applying background subtraction (73) and then linear dye-bias correction matching the signal intensities of the two detection channels. Probe signals with detection p-value<0.05, as well as probes overlapping common SNPs and putative repetitive elements which cause potential cross-hybridization were then masked (74). For external data sets where raw IDATs were unavailable, processed beta values downloaded from GEO were used. Based on WGBS analysis, HM450 probes were classified according to the number of neighboring CpGs and the tetranucleotide sequence context. Only probes targeting solo-WCGW CpGs are retained. Also removed were probes falling into annotated CpG Islands, or those unmethylated (beta<0.2) in at least 20 of the 749 matched normal tissue samples included in TCGA. This resulted in 6,214 probes in common PMDs and 9,040 probes in common HMDs. Four letter acronyms for cancer types were taken following the official TCGA nomenclature. The difference of methylation between the mean methylation of solo-WCGW probes located in common PMDs and those in common HMDs was used to measure the degree of PMD-associated DNA hypomethylation in each sample. This method avoids confounding in the case of cancer types derived from globally de-methylated cell types such as primordial germ cells (
Analysis of the IMR90 Epigenome.
Features are clustered using 1−|ρ| as distance where r is the Spearman's correlation coefficient. Centromeres are excluded from IMR90 analysis. IMR90 epigenome data was downloaded from the ENCODE project data center (accessions listed in
Rescaling Based on PMD Methylation.
The distribution of methylation values within common PMD 100-kb bins was calculated. The top and bottom 20% of this distribution was trimmed for each sample, setting low values to 0 and high values to 1, and linearly rescaled all values between 20% and 80% to the range [0,1] (
Stratified Analysis of Solo-WCGW CpGs in the Genome.
The Solo-WCGW CpGs were first classified (
Statistics.
Except for when described explicitly in the text, P-values for two-group comparison were calculated using one-tailed Wilcoxon's Rank Sum test. Correlation coefficients were computed with Spearman's method, with the exact P-values calculated in R using algorithm AS (89), otherwise via asymptotic t-approximation when exact computation was not feasible.
Data availability.
The WGBS data (incorporated by reference herein) is available in Genome Data Commons (GDC) under the TCGA project with IDs and file names shown in
Code availability.
Our customized work flow for preprocessing WGBS sequencing data is freely accessible (see under URLs below; incorporated by reference herein).
URLs.
Roadmap Epigenomics data is downloaded from ftp://ftp.ncbi. nlm.nih.gov/pub/geo/DATA/roadmapepigenomics/. BLUEPRINT epigenome project data is downloaded from ftp://ftp.ebi.ac.uk/pub/databases/blueprint/. ENCODE data project is downloaded from www.encodeproject.org. The Bis-SNP easy run procedure is detailed at http://people.csail.mit. edu/dnaase/bissnp2011/stepByStep.html. The entire customized work flow ECWorkflows is hosted and freely available at https://github. com/uec/ECWorkflows. Picard tools was downloaded from http://broadinstitute. github. io/picard.
(PMD Hypomethylation in Immortalized Cell Lines was Demonstrated Using the Solo-WCGW Motif)
According to particular aspects, PMD hypomethylation was observed in almost all cultured cell lines except for ESCs, iPSCs and their derived cell lines (
Note that although both ESCs and the proliferative tumors were high in the expression of DNMT3s compared to other normal tissues of non-embryonic origin, the level of expression in ESCs was higher than the most proliferative tumors. For example, the expression of DNMT3B in H1 hESC was higher than other cancer cell lines and primary tissues assayed in the ENCODE project by over ten-fold (
Specifically,
(Improved Analysis of HMD/PMD Structure was Demonstrated Using the Solo-WCGW Motif)
The primary focus of the present disclosure has been on cell-type invariant PMDs, which were useful for investigating general properties of methylation loss over time. The 49% of the genome we identified as occurring within “Common PMDs” (using the SD>0.15 method) contains essentially all of the cell-type-invariant PMD regions that applicants identified previously (84). PMDs were defined in the present work by exploiting the inherent variance in PMD hypomethylation levels across large cohorts of samples, which was the only cross-sample feature bimodally distributed between HMDs and PMDs. Under this definition, for example, the core tumor group (containing only solid tumors) had almost the same degree of shared PMDs with blood malignancies (82%) as it did with other solid tumors not from the core set (85%) (
Specifically,
The present focus on common PMDs does not discount the importance of cell-type-specific PMDs. The work of applicant's group and others showed that about 25% of PMDs were cell-type specific (80, 81), and the present results here do not conflict with that. Others have established that cell-type specific cancer PMDs can be associated with gene expression differences, and distinguish different molecular subtypes of medulloblastoma and Atypical Teratoid/Rhabdoid tumors (81-83). Work from Fortin and Hansen showed that these cell-type-specific PMD differences corresponded to cell-type-specific topological domain and chromatin structure differences using Hi-C and DNase data from the same cell lines (84).
Deep PMD hypomethylation was observed in the methylome of T cells from a 103-year-old individual (
While the discovery of solo-WCGW CpGs is a significant advance, the ability to detect differential PMDs in normal cell types with low levels of methylation loss, will remain a challenge. This is an important challenge to tackle, as it may allow the identification of PMD-associated cell-of-origin markers in cancer, which can be combined with mutational-signature-based cell-of-origin markers (85). PMD domain structure can also act as a useful proxy for 3D topological changes and other chromatin features in clinical disease samples where Hi-C or other direct mapping methods are not feasible due to the quantity or quality of intact chromatin available. PMDs also mark regions of gene silencing, and thus can help to infer the gene expression history of the cells being sampled. For instance, Hovestadt et al. showed that PMDs in medulloblastoma tumors reflected subtype-specific expression silencing in normal brain precursor cells (90).
(Stability of Rank-Based Correlation Between Methylomes was Demonstrated Using the Solo-WCGW Motif])
A rank-based analysis of 792 genomic 100 kb bins from chromosome 16 (
Specifically,
(Alternative Explanation of PMD Hypomethylation)
While the present analysis supports replication timing as the most strongly associated genomic determinant of PMD methylation loss, replication timing is in practice very tightly linked to the Hi-C compartment “B” and the nuclear lamina based on applicants' work and the work of others (90, 91, 92). While the re-methylation window model is mechanistically attractive, we cannot rule out an alternative nuclear localization model (
(Relevance of the PMD Sequence Signature to Somatic and Germline Mutational Landscape was Assessed)
To investigate any potential impact of the PMD sequence signature on introducing cytosine deamination mutations in the CpG dinucleotides, the relative proportion of somatic mutations that are within certain tetranucleotide sequence contexts and certain numbers of neighboring CpGs was studied. Somatic CpG to TpG mutations reported in an early gastric cancer whole-genome sequencing experiment was compared, and indeed confirmed that solo-WCGWs within late replicating PMDs had a lower CpG to TpG mutation rate compared with other sequence context (
While only a limited number of samples were available for gametogenesis, dramatic PMD hypomethylation was observed in at least one germline cell type, the Germinal Vesicle, M-I Oocyte (
Specifically,
(Certain Specific Sub-Patterns that Match the Solo-WCGW Definition were Found to be More Predictive than the General Definition, and DNA Shape Features were Also Found to be Predictive)
Above, working Example 1 demonstrates that the Solo-WCGW motif is highly predictive of PMD methylation loss across a large number of cell types and across mammalian species. Formally, Solo-WCGW is defined as n(x)WCpGWn(x), where a series of x positions on either side can match any base n (A,C,T, or G) but none can match a CG dinucleotide. According to particular additional aspects of the present invention that we have demonstrated, much of the predictive value (for replication-associated methylation loss) is captured by this general pattern. However, this pattern represents a large number of actual sequence instances (using the preferred definition of x=34, there are approximately 3 million unique individual matching sequences in the human genome), and thus we investigated if it is possible to define sub-patterns that may further improve the predictive value, and that be used to prioritize sequences used in, for example, biomedical tests and other methods described herein. An exemplary covariance analysis was performed that supports the presence of such sub-patterns, as described below.
In the analysis, we started with the set of all Solo-CpGs (n(35)CpGn(35)) that fell within each common PMD as described above, and then compared the similarity of each Solo-CpG to all others within the common PMD using covariance across samples in our human WGBS set, described above. Hypomethylation prone Solo-CpGs were found to have high average covariance with other Solo-CpGs within the same PMD, and we defined those with average covariance greater than or equal to the 85th percentile of covariance for all Solo-CpGs in all common PMDs in the genome as “hypomethylation prone”. Those with covariances less than or equal to the 5th percentile of all values, with average methylation across all samples of >0.7, were defined as “hypomethylation resistant”. We then calculated the ratio of hypomethylation resistant to hypomethylation prone frequencies for all sextanucleotide Solo-CpG sequences (matching the pattern “NNCGNN”), and sorted sequences from those most resistant to those most prone, as shown in
In addition to DNA sequence patterns, DNA secondary structure or “DNA shape” is known in the art to play a role in the binding efficiency of chromatin modifying proteins, and may thus also be useful for defining sub-patterns of the Solo-WCGW pattern that can be used for prioritization of sequences to use, for example, in biomedical tests and other methods to improve the accuracy of replication-associated hypomethylation prediction. We have used the same hypomethylation resistant vs. hypomethylation prone analysis described in the last paragraph, to investigate the association of DNA shape, using the tool DNAShapeRTM (102). By comparing DNA shape in the most hypomethylation resistant vs. most hypomethylation prone Solo-CpGs, we determined that one particular DNA shape, “propeller twist” was specifically low in the hypomethylation prone Solo-CpGs, as shown in
Specifically,
(Materials and Methods for Examples 16-18)
Primary Cell Culture.
Primary human cells obtained from multiple tissues and donors (n=5, Table 12), as facilitated by biobank Coriell, were serially-cultured until replicative senescence. At each passaging, or replating, of cells, cell count and viability was measured to calculate population doubling level (PDL), the metric for observed mitotic history. DNA was extracted from cells at each timepoint (n=116).
DNA Methylation Assay.
Bisulfite-converted DNA was applied to an Illumina HumanMethylation EPIC microarray and fluorescence was measured aboard an Illumina iScan at probes sensitive to methylation status at >850,000 CpGs in the human genome. Other DNA methylation assays can be substituted for the EPIC array, such as other Illumina methylation arrays or whole genome bisulfate sequencing.
Beta Calling.
Using the sesame package (103) in statistical software R, raw fluorescence intensities were normalized to out-of-band fluorescence intensity (73) before beta value calculation. Beta value is the measure of degree of methylation at a given CpG dinucleotide; a beta value of 1 reflects complete methylation and 0 reflects complete unmethylation. Beta-calling of Illumina 450K and EPIC arrays is supported by sesame; other upstream methylation analyses will have different processing requirements.
Qa/Na Removal.
Specific samples and probes which exhibited consistently poor performance, as determined by NA/missing values returned on >5% of CpGs or samples, respectively, were removed. NA probe filtering stringency of the test set shown from hereafter was complete to ensure a most-reproducible probe set: probes with ≥1 NA (n=279,797) were removed, although differing applications may allow more relaxed filtering.
Solo-WCGW Subsetting.
Following sample and probe removal, probes were filtered to include only solo-WCGW CpGs in common PMDs (n=26,732 on EPIC microarray, n=9,711 following complete NA removal). Solo-WCGW identity is based on profiling of human genome build 19 (hg19); a full manifest is available at http://zwdzwd.io/pmd/soloWCGW_inCommonPMDshg19.bed.gz. Sequence positions may differ slightly by genome build.
(Elastic Net Modeling Strategy)
PDL Standardization.
Elastic net regression (ENR) was applied via the glmnet package in R across individual donor cultures, regressing against observed PDL in culture. Glmnet settings were mostly default; alpha was set to 0.5 (to achieve ENR) with gaussian distribution. A linear model was automatically selected. The mitotically youngest donor culture was AG21839, a neonatal foreskin fibroblast cell line. To standardize PDL and allow for development of a multi-tissue mitotic clock, starting PDLs from all other cell lines were normalized to the ENR model built from AG21839 (Table 12, ‘Standardized PDL’). Delta PDL was added to adjusted starting PDL for the following timepoints.
Multi-Tissue ENR Modeling.
Using prefiltered beta values from all cultures with standardized PDL, ENR was again performed using the same settings as above.
10-Fold Cross Validation and Probe Reduction.
To select the number of CpGs allowed in the model and control for potential overfitting, 10-fold cross validation was performed on the model. Lambda was set at lambda minimum+1 standard deviation, resulting in 44 CpGs included in this model (Table 13).
Model Performance.
A heatmap of beta values at the selected CpGs across advancing PDL shows consistent hypomethylation across donors, cell types, and subcultures (
Suggested Use:
The elastic net regression strategy produced a robust 44-CpG model for predicting mitotic history within and between cell types (Tables 15A-B).
(Individual Probe Regression Strategy)
Simple linear regression was applied individually to each prefiltered probe.
Regression coefficients r and r2 from all primary cell cultures were compared.
Density plots of regression coefficients r and r2 (
Model Performance:
A heatmap of the selected CpGs across advancing PDL shows consistent hypomethylation across donors, cell types, and subcultures (
Suggested Use:
The individual probe regression strategy, yielding a subset of 75 (Tables 16A-B) strongly correlated probes for all tissue types studied, offers an immediate refinement of the solo-WCGW signature. When beta values of these CpGs are weighted equally, robust intra-cell-type mitotic history comparisons are possible.
(Elastic Net Model Versus Individual Regression Model)
While both are highly predictive, the probe landscapes of the two mitotic clocks are rather distinct. There are only two overlapping CpG between the sets, cg15328937 and cg23127532; both are negatively correlated in both models. Nine and 35 CpGs of the elastic net model are positively and negatively correlated with mitotic age, respectively. Regression coefficients for the elastic net model range from −19.24−15.52; the intercept is 83.01. For the individual regression model, all CpGs are equally-weighted by taking the mean, but each cell type has a different intercept, ranging from 0.500 for AG16146 to 0.738 for AG11546, and slope, ranging from −0.005 for AG21839 to −0.011 for AG16146. Whereas the elastic net model places multi-tissue-type mitotic history on the same scale, the individual regression model's cell-specific slope/intercept values likely reflect slight differences in rates of solo-WCGW hypomethylation across tissue type and age.
(Comparison to Existing Clocks)
Comparison to Hannum Clock.
Hannum pioneered the modern methylation clock with a 71-CpG model (58) that predicts chronological age with high accuracy (>90% accuracy with mean error of several years) in whole blood samples in adults. In addition to introducing a high-performing methylation clock, to produce it Hannum et all implemented elastic net regression (104) via the glmnet package (105) in statistical software R. Elastic net regression (ENR) combines Lasso and ridge regression techniques to reduce both the number of variables and the relative contribution of each variable to a multivariate model, in which the number of potential variables vastly outnumbers the observations. It has since proven to be adept at modeling methylation clocks while controlling for overfitting. Definitively limiting its adoption, Hannum's clock performs poorly in non-blood samples and in blood samples from children; the composition of white blood cells and resulting methylation patterns changes dramatically during development. Three of the 71 CpGs are solo-WCGWs; none of these are present in the solo-WCGW clock. A heatmap of beta values at Hannum CpGs is shown in
Comparison of DNAm Age.
The most widely-applied methylation clock, ‘DNAm Age,’ (59) predicts chronological age with high accuracy in most human tissues. Elastic net regression was applied across a large dataset of Illumina Infinuim HumanMethylation 27K and 450K BeadChip array data from apparently-healthy human tissues of different chronological ages to mathematically select 353 CpGs and individual coefficients for each CpG. The weighted average of coefficient-multiplied beta values at these CpGs estimates chronological age with high accuracy across most tissues. Of the 353 CpGs, 193 are positively and 160 are negatively correlated with chronological age. DNAm Age was developed to perform well on multiple tissues with extremely variable mitotic capacities (e.g. brain and liver) so it is unsurprising that there is no overlap between it and the solo-WCGW clocks, however, three of the 353 CpGs are solo-WCGWs in common PMDs. A heatmap of beta values at DNAm Age CpGs is shown in
Comparison to Skin & Blood Clock.
Despite high performance across most tissues, DNAm Age predictability underperformed on skin and blood samples. For clinical and forensic applications, skin and blood tissues are amongst the easiest to collect and thus the application of DNAm Age was limited. To remedy this, Horvath developed a similar ‘Skin & Blood Clock’ (106) which shares 60 CpGs (of 391) with DNAm Age. Six of these CpGs are solo-WCGWs, although there is no overlap of these probes with the three solo-WCGWs in DNAm Age. Again, there is no probe overlap between the solo-WCGW clocks and the Skin & Blood clock. A heatmap of beta values at Skin & Blood Clock CpGs is shown in
Comparison to DNAm PhenoAge.
The ‘DNAm PhenoAge’ methylation clock (107) was trained not to predict chronological age of tissues but to predict all-cause mortality, or ‘phenotypic age,’ as defined by a panel of biomarkers. Using the same mathematical parameters as Horvath's chronological methylation clocks, ENR produced 513 CpGs, of which 57 overlap with DNAm Age and 41 overlap with the Skin & Blood Clock (20 are shared by all 3 models, albeit with differing weights). Four of these CpGs are solo-WCGWs, however none of these are probes within the solo-WCGW clocks. A heatmap of beta values at PhenoAge CpGs is shown in
Comparison to EpiTOC′ Mitotic-Like Methylation Clock.
More comparable in developmental strategy and in application to the solo-WCGW clock is the ‘epiTOC’ mitotic-like methylation clock (108). Whereas DNAm Age, the Skin & Blood Clock, and DNAm PhenoAge were unsupervised in their construction, instead solely relying on glmnet-powered ENR and 10-fold cross validation to select probes and coefficients, Yang et al prefiltered CpGs based on the observation that polycomb target CpGs gain methylation with advancing age in a seemingly mitotic-capacity-driven manner. PRC2 polycomb target CpGs (109) were subsetted from the large whole blood dataset Hannum cultivated, and only CpGs that were unmethylated in fetal tissues and gained methylation over advancing chronological age in the training set were considered for the model: 385 CpGs remained. The epiTOC model was not built on ENR but takes the untransformed mean of the beta values at these 385 CpGs to estimate relative mitotic age. This model was trained solely off whole blood samples yet its authors have applied it to multiple tissues. None of the 385 epiTOC CpGs are present in DNAm Age, Skin & Blood, DNAm PhenoAge, or the solo-WCGW clocks. Indeed, none of the epiTOC probes are solo-WCGWs; this is likely a product of preselecting only PRC2-target CpGs. A heatmap of beta values at epiTOC CpGs is shown in
The solo-WCGW mitotic clock of the present invention is the first model to estimate mitotic age with high accuracy in primary cell culture (Table 3). Relative mitotic age estimation and comparisons between same-tissue samples can be performed with either the elastic net model or the independent regression model. Cross-tissue mitotic age comparisons (e.g. directly comparing skin tissue to vascular smooth muscle tissue) and absolute mitotic history can be estimated with the elastic net model and not the independent regression model. The construction of the solo-WCGW clock is unique in that it is the first of its kind to be trained from serial cell culture data. This feature gives the clock increased sensitivity—down to individual population doublings—over other methylation clocks which estimate age in years (with mixed success on cell culture data, see
According to additional aspects, therefore, more specific definitions within the general Solo-WCGW pattern are provided for prioritization of sequences used in biomedical tests and other methods disclosed herein to track replication-associated DNA methylation loss.
(Additional Exemplary Methods)
Particular aspects of the present invention, provide, but are not limited to the following exemplary methods:
A method for determining chronological age, or accelerated chronological age of a cell or tissue sample of a test subject, comprising:
collecting cell and tissue samples, sort cells if necessary;
extracting DNA;
performing bisulfate conversion and library preparation (e.g., sonicate DNA, PCR amplification);
measuring beta*values (e.g., using 1000 probes with the extension base targeting solo-WCGW CpGs);
computing a score by taking the average of these solo-WCGW CpG beta values;
using the score as an indication of mitotic age;
computing a calibration curve by looking at the mitotic age score computed above in a population in a range of chronological ages; and
for test individuals, interpolating the chronological age to compare the standard mitotic age with the test mitotic age to determine if there is accelerated aging.
(*The Beta-value is the ratio of the methylated probe intensity and the overall intensity (sum of methylated and unmethylated probe intensities; e.g., see Du, Pan, et al., BMC Bioinformatics 2010; 11:587; doi 10.1186/1471-2105-11-587, (incorporated by reference herein).
A method for determining the mitotic turnover history of a cell, comprising:
collecting/immortalizing a primary cell line (e.g., lymphoblastoid cell line or other tissues);
passing the cell line to certain passage numbers;
extracting DNA for each cell with a certain passage number, and performing bisulfate conversion and library preparation;
calibrating the passage number against solo-WCGW beta value averages (e.g., using 1000 probes with the extension base targeting solo-WCGW CpGs); and
for test samples, interpolating the passage number using the measured solo-WCGW value averages.
A method of measuring excessive replicative turnover history in cancer by comparing to matched normal cell-type of origin, comprising:
collecting, for each tumor, a normal cell type of origin;
deriving a passage number calibration curve using the method above;
interpolating the passage number of the tumor cells; and
comparing the passage number of the tumors with the normal.
A method for measuring increased risk of a subject for conditions associated with excessive replicative turnover or aging (e.g., cancer, neurodegenerative disease, cardiovascular disease, progeria etc.), comprising:
collecting relevant tissues/cell types from affected individuals and disease-free controls;
measuring the passage number using the method described above, wherein the passage number is associated with the disease onset and age; and
calibrating the risk for the corresponding disease using the determined passage number of the relevant cells.
A method for identifying subjects for increased surveillance and screening, comprising:
collecting cell-free circulating DNA from patients or test individuals and disease-free controls;
performing bisulfite conversion and library preparation;
computing a mitotic replicative score by averaging the solo-WCGW CpG beta values (e.g., using 1000 probes with the extension base targeting solo-WCGW CpGs); and
identifying subjects in need of increased surveillance and screening if their mitotic replicative score is significantly higher than disease-free controls.
A method for forensic analysis, comprising:
collecting tissue from the crime scene;
extracting DNA and performing bisulfite conversion;
measuring solo-WCGW CpG methylation average in the extracted DNA (e.g., using 1000 probes with the extension base targeting solo-WCGW CpGs); and
computing a chronological age using a matched cell type using the method outlined above.
References cited with respect to working Examples 1-7, and incorporated herein by reference for their respective teachings:
References cited with respect to working Example 8, and incorporated herein by reference for their respective teachings:
References cited with respect to working Examples 9-13, and incorporated herein by reference for their respective teachings:
100. Wang, K. et al. Whole-genome sequencing and comprehensive molecular profiling identify new driver mutations in gastric cancer. Nat. Genet. 46, 573-582 (2014).
101. Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519-522 (2017).
The references cited above are incorporated herein by reference for their respective teachings.
This application claims priority to U.S. Provisional Application 62/637,979 filed on Mar. 2, 2018, the disclosure of which is considered part of the disclosure of this application and is hereby incorporated by reference in its entirety.
This invention was made with government support under Grant Nos. U24 CA210969, U01 CA184826, and U24 CA143882, awarded by the National Institutes of Health, and RO1 CA170550, and RO1 HG006705 awarded by National Institutes of Health/National Cancer Institute. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2019/051689 | 3/2/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62637979 | Mar 2018 | US |