The disclosure generally relates to a deep learning framework utilizing spectral clustering and discrete cosine transform of non-native data to predict the age of a subject using methylation data derived from cells, and to identify conditions of a subject associated with aging. The disclosure also relates to treating conditions associated with aging, including Herbalmax® Reinvigorator™ Gen2 nicotinamide mononucleotide (NMN) dosing and administration, consistent with and/or based on the prediction.
Traditional deep learning methods typically use approximately 353 CpG island measurements as input to a model containing 4×95-sized layers and 73 epochs of training. These methods are less accurate when predictions are based on non-native data sets. Typical deep learning may predict each individual in a 100-sized test set separately where r=0.58 and total error 2200 years when applied on non-native data. As such, prior art methods of filtering data sometimes produce performance gains that may stem from artifacts, resulting in inaccurate results.
Age prediction of humans (using methylation data) has, in the prior art, concentrated largely on the use of regression models. This has been done by predicting age as a single number, measured in years (PMID: 25968125). However, regression-based age predictions show only limited success when it comes to attributing complex and differential ageing to largely unknown biological systems. The difference between predicted and actual age has often been used as a strategy to address complex ageing phenotypes that are difficult to characterize. In addition, attempts have been made to simultaneously and randomly alter the ageing of many genes at a time.
Prediction accuracy of prior art methods is estimated to be within approximately 3 years on native data, where the discrepancy is largely attributed to each individual's biologically unique senescence process, and to a lesser extent, prediction error. While some methods make predictions using only a few CpG islands, prior art methods have included using regression models, and more recently deep learning methods applied to non-native data, such as using Tensorflow and the Keras package in a Python environment, including the use of Variational Auto Encoders (VAEs), as in the MethylNet method (PMID: 32183722). Existing methods of filtering data may result in performance gains, but those gains could stem from artifacts in the sample.
The present invention solves the problem presented when a deep learning model is trained on one type of data, for example from blood, but is used to make predictions on another type of data, for example saliva. Results using prior art methods may not be accurate, since those methods may make predictions as having the same age rather than creating a good estimate of the true age.
Variational Auto Encoders have been explored extensively as a solution to this problem, but we have found that a more successful approach is to use discrete cosine transform (DCT) coefficients to model a curve of the age predictions of a group of, for example, 100 individuals. To make the curve appear more simplified, but without knowing the true age, the corresponding methylation data for the 100 individuals is clustered using spectral clustering.
After applying spectral clustering, the corresponding age graph will take on a more regular shape that can more successfully be approximated with fewer DCT coefficients. For the methylation data of each individual, we consider 353 CpG islands commonly used in the literature. For the 100 individuals, each CpG island can be represented as a curve going across the 100 individuals, and each such curve can be approximated by a limited number of DCT coefficients, resulting in a table of 353 rows of DCT coefficients, and a single row of the age annotation DCT coefficients (
Provided herein are methods for detecting differences in DNA methylation in a subject utilizing spectral clustering and discrete cosine transform on non-native data, namely data different from the training data. In some aspects, the method includes obtaining a DNA sample from a subject, processing the DNA sample, detecting the CpG short tandem nucleic acid sequence in the DNA sample, comparing the CpG short tandem nucleic acid sequence and methylation level thereof, to a non-native data set using a spectral clustering algorithm (SCA) and discrete cosine transform (DCT), and providing results to the subject. In some embodiments, the DNA sample is derived from human cells, preferably plasma or urine free DNA.
In some embodiments, comparing the CpG short tandem nucleic acid sequence in the DNA sample further comprises comparing the degree of methylation in a DNA sample to a non-native data set generated by SCA and DCT. In some embodiments, the method further includes identifying the abnormal state associated with differentially methylated CpG islands.
In a preferred embodiment of the present invention, a training or test data point is viewed as a unique extraction of 100 individuals, rather than that of a single individual. Each extraction's age array of 100 numbers is represented as a set of DCT coefficients. This data may be used to accurately model 100 ages using 50 coefficients or less. The effect is improved if the data is clustered beforehand. There are several methods that can be used to cluster the data, but in a preferred embodiment, the data is clustered using a spectral clustering algorithm.
Also provided herein are methods for detecting a condition associated with aging. In some aspects, the method includes treating a condition associated with aging. In some embodiments, the method includes obtaining a DNA sample from a subject, detecting the methylation of at least one of the nucleic acid sequences, or a combination thereof, comparing the DNA sample to a known population, and providing treatment, including NMN administration, to the subject based on the condition associated with aging.
The contents of this section are intended as a simplified introduction to the disclosure, and are not intended to limit the scope of any claim.
In one aspect, the disclosure provides a method for detecting differences in DNA methylation between a DNA sample and a non-native data set. In some embodiments, the method comprises obtaining a DNA sample from a subject, processing the DNA sample, comparing the DNA sample to a non-native data set using spectral clustering and discrete cosine transform, and providing results to the subject.
Discrete Cosine Transform (DCT) is a technique used to convert a signal or image into elementary frequency components. An input signal is represented as a linear combination of weighted basis functions that are related to its frequency components. It is a means of decomposing a signal into frequency dependent components. Low frequency components of the signal are assigned a low weight as related to high frequency components. Low weight frequencies are discarded and the remaining components are used to reconstruct the signal. The signal is subsequently recomposed to produce a signal that is devoid of low frequency components.
DCT results are improved where the data is clustered beforehand. In particular, spectral clustering (SCA) reduces complex multidimensional data sets into clusters of similar data and is used to identify communities of nodes on a graph. The algorithm uses the connectivity approach to clustering to find patterns in the data and group the data into clusters. Points that are immediately next to each other or connected, are put into the same cluster. Data points on a graph are treated as nodes and clustering is treated as a graph partition. The nodes are mapped to a low dimensional space that can be segregated to form clusters. There are no assumptions made about the shape or form of the clusters, and the process involves three steps: (1) compute a similarity graph; (2) project the data onto a low dimensional space; and (3) create clusters.
Spectral clustering uses information from the eigenvalues of special matrices derived from the graph or data set. SCA treats the data as a graph partitioning problem without making any assumptions about the form of the data clusters. Unlike K-Means, which assumes that the points assigned to a cluster are spherical around the cluster center, which is a strong assumption that may not always be relevant, SCA creates more accurate clusters. It can correctly cluster observations that actually belong to the same cluster, but are farther off than observations in other clusters, due to dimension reductions.
The data must be projected into a low dimensional space. To do this, the graph Laplacian must be computed, which is a matrix representation of the graph and can be used to find properties of the graph. Computing the graph Laplacian gives eigenvalues and eigenvectors, which allows us to embed the data points into a low dimensional space.
Finally, clusters are created using the eigenvector corresponding to the second eigenvalue to assign values to each node. The lower the eigenvalue, the better the cluster. Nodes with a value less than zero are in one cluster and nodes with a value greater than zero are in another cluster.
In a preferred embodiment, a training or test data point is viewed as a unique extraction of 100 individuals, rather than a single individual, and each extraction's age array of 100 numbers may be represented as a set of DCT coefficients. A set of 100 ages may be accurately modeled using 50 coefficients or less. The effect is improved if the data is clustered beforehand, preferably by spectral clustering. The columns of data do not need to be sorted by age, but instead are placed in groups of, for example, 20, where each group has a typical age distribution (
Sorting by groups with a typical age distribution generates a more reliable model using fewer DCT coefficients. The DCT coefficient #25 for the CpG islands (353×DCT coefficient #25) may be used as training or test input and the corresponding DCT coefficient for age may be used as the age-like annotation, or the prediction outcome (see
As a result, it is possible to accurately predict approximately 30 DCT coefficients to recreate an age graph of the current set of 100 extracted individuals (see
In a preferred embodiment of the present invention, rows are placed in 20 groups, where each group has a typical age distribution (see
The principle of predicting, for example, DCT coefficient #3 for age, using an array of DCT #3 coefficients from each of 353 CpG islands (see
An approach combining DCT and spectral clustering when applying deep learning program on non-native data produces improved results (see
In some embodiments, the DNA sample is derived from human cells, tissues, blood, body fluids, urine, saliva, feces, or a combination thereof; preferably plasma or urine free DNA from humans. Free DNA in peripheral blood is mainly derived from hematopoietic cell lines; serum contains more DNA from hematopoietic cell lines than plasma free DNA, so the use of free DNA samples from plasma can reduce the background of the assay. The free DNA in the peripheral blood is excreted in the urine, so the urine also contains a large amount of free DNA. The urine test is a more non-invasive test than the peripheral blood test.
In some embodiments, processing the DNA sample is determining the degree of methylation of the CpG short tandem nucleic acid sequence in the DNA sample. In some embodiments, the processing of the DNA sample detects more than 100 CpG islands in the subject's genome. These CpG short tandem nucleic acid sequences have three characteristics: a) a large number of copies exist in the human genome, and tens of thousands of CpG short tandem sequences exist in the human genome, so that they can be used as target sites for genome-scale detection; b) CpG short tandem sequences are highly enriched in CpG islands, so they contain important DNA methylation information that can be used to determine abnormal state of the human body; and c) these CpG short tandem sequences have high melting temperatures and can therefore be effectively used for amplification.
In some embodiments, comparing the DNA sample to a known population compares the degree of methylation of the CpG short tandem nucleic acid sequence in the DNA sample with the degree of methylation in a DNA sample from a normal population.
In some embodiments, providing results to the subject may comprise providing results to a physician. In some embodiments, the physician determines a subsequent health assessment, including NMN dosing.
In another aspect, a method for detecting a condition associated with aging is described herein. In some embodiments the method comprises obtaining a DNA sample from a subject; detecting the methylation of at least one of the nucleic acid sequences, or a combination thereof; comparing the DNA sample to a known population; and providing treatment, including NMN formulations, to the subject based on the condition associated with aging.
In another aspect, a method for determining deep learning parameters to be used in detailing age prediction is provided. In some embodiments, the method comprises obtaining the following parameters: 1) size of layer; 2) number of layers; and 3) epochs of training. An important aspect of the invention, a parameter scanning approach is used where the aforementioned parameters are sampled at random and performance is optimized, usually to maximize prediction accuracy when using a particular CpG island list that represents e.g. a GO category. By parameter scanning, we can usually offset any performance loss that occurred when substituting CpG island lists.
In another aspect, the methods described herein can be applied to clinical research on the one hand, large-scale analyses to identify differentially methylated CpG islands associated with abnormal states of the human body; on the other hand, they can be applied to clinical molecular tests to predict individuals by differentially methylated CpG islands to identify whether there is an abnormal state.
Another aspect of the disclosure is a deep learning-tailored version of the CpG island list presented in Genome Biol. 2013; 14(10):R115 (Supplement 3), where at least 50% of the known CpG islands have been replaced. The published list of CpG islands is known to the person skilled in the art, and is represented in Genome Biol. 2013; 14(10):R115 (Supplement 3). A “modified CpG island list” or a “synonymous CpG island list”, as used herein, means that at least 50% of the CpG islands have been replaced. In some embodiments, the predictive performance (as measured by total number of errors per 100 test cases) is maintained in the modified version of the CpG island list. In some embodiments, the modified CpG island list achieves prediction performance better than 300 errors per 100 cases, a Pearson correlation of 0.98, or a median absolute difference between predicted and chronological age of less than 3.6 years in 50% of the subjects on native data.
Another aspect of the disclosure is a CpG island list, which encodes a biological system-specific age predictive property, according to the disclosure. In some embodiments, the list comprises system-specific CpG island sets consisting of CPG_LIST ID NO:1-5. In some embodiments, the list is biased towards a particular GO (Gene Ontology) term, displaying an FDR (False Discovery Rate) of 7.17E-07 or better. In some embodiments, an FDR rate of 6.12E-18. In some embodiments, an FDR of 4.33E-27 or better. As a non-limiting example, the modified CpG island list can be created by a “synonymy” strategy (see Methods).
Another aspect of the disclosure is the use of a noise approach to generate probability density functions of age prediction, to enable system-specific comparison of a senescence process. “To enable system-specific comparison”, as used herein, means that the deep learning prediction is repeated multiple times with stochastic noise added in parallel. In some embodiments, noise taken from the distribution of a CpG island's probability of being methylated is used, when compared across systems.
Deep learning, as used herein, can be any workflow involving a Tensorflow frontend, including, but not limited to the Keras package in Python. Preferably, predictions are made using one or more of the system-specific CpG island lists, see CPG_LIST ID NO:1-5. Even more preferably, the deep learning workflow parameters are selected from the custom and biosystem-specific 2-D parameter space ranges presented in PARAMETER_LIST ID NO:1-5.
In one embodiment, an exemplary workflow of learning from data and performing diagnosis starts from obtaining a raw data set and ingesting the raw data set to a computer system. The raw data set may comprise bioinformatic data and biological properties of a plurality of samples. Each sample may come from a specific tissue of an individual. The bioinformatic data of a sample may comprise methylation probability of CpG islands as a function of chromosomal location of the CpG islands. The biological properties of a sample may comprise the tissue type of the sample, disease state of the sample, and chronological age of the individual.
After inputting the raw data set to the computer system, the workflow then inputs the raw data set into a feature selection module programmed in the computer system. The feature selection module is configured to process the raw data set and output a training data set.
After obtaining the training data set, the workflow then inputs the training data set to a machine learning module programmed in the computer system. In some embodiments, the machine learning module comprises an artificial neural network, e.g., a deep learning model provided by the Keras library. The machine learning module uses the training data set to inform a hypothesis which minimizes the in-sample error, and subsequently outputs a learned model.
After obtaining the learned model, the workflow then inputs the learned model to a validation module programmed in the computer system. In addition, the workflow inputs a testing data set to the validation module. The testing data set has the same kind of data as the raw data set, but from a different set of samples. The validation module evaluates the testing data set on the learned model, and obtains a performance, i.e., out-of-sample error.
If the performance is not good, i.e., the out-of-sample error is not smaller than a desired threshold, the workflow can re-program the feature selection module to select alternative features, modify the hypothesis set used in the machine learning module (e.g., modify the artificial neural network structure), or both.
If the performance is good, i.e., the out-of-sample error is smaller than a desired threshold, the workflow then outputs the learned model as the final model. When presented with a new data from a patient, the workflow inputs the new data and the final model into a diagnosis module programmed in the computer system. The diagnosis module can use the new data and the final model to generate a diagnosis output for the patient.
A biological sample suitable for use in the methods disclosed herein can include any tissue or material derived from a living or dead subject. A biological sample can be a cell-free sample. A biological sample can comprise a nucleic acid (e.g., DNA, e.g., genomic DNA or mitochondrial DNA, or RNA) or a fragment thereof. The nucleic acid in the sample can be a cell-free nucleic acid. A sample can be a liquid sample or a solid sample (e.g., a cell or tissue sample). The biological sample can be a bodily fluid. In various embodiments, the majority of DNA in a biological sample that has been enriched for cell-free DNA (e.g., a plasma sample obtained via a centrifugation protocol) can be cell-free (e.g., greater than 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the DNA can be cell-free). The biological sample can be treated to physically disrupt tissue or cell structure (e.g., centrifugation and/or cell lysis), thus releasing intracellular components into a solution which can further contain enzymes, buffers, salts, detergents, and the like which are used to prepare the sample for analysis.
Methods, compositions, and systems provided herein can be used to analyze nucleic acid molecules in a biological sample. The nucleic acid molecules can be cellular nucleic acid molecules, cell-free nucleic acid molecules, or both. The cell-free nucleic acids used by methods as provided herein can be nucleic acid molecules outside of cells in a biological sample. The cell-free nucleic acid molecules can be present in various bodily fluids. Cell-free DNA molecules can be generated owing to cell death in various tissues that can be caused by health conditions and/or diseases, e.g., tumor invasion or growth, immunological rejection after organ transplantation.
Cell-free nucleic acid molecules, e.g., cell-free DNA, used in methods as provided herein can exist in plasma, urine, saliva, or serum. Cell-free DNA can occur naturally in the form of short fragments. Cell-free DNA fragmentation can refer to the process whereby high molecular weight DNA (such as DNA in the nucleus of a cell) are cleaved, broken, or digested to short fragments when cell-free DNA molecules are generated or released. Methods, compositions, and systems provided herein can be used to analyze cellular nucleic acid molecules in some cases, for instance, cellular DNA from a tumor tissue, or cellular DNA from white blood cells when the patient has leukemia, lymphoma, or myeloma. Sample taken from a tumor tissue can be subject to assays and analyses according to some examples of the present disclosure.
Tissue-specific markers can identify a tissue of origin for a DNA molecule. In some cases, the tissue-specific marker is a polynucleotide sequence of the genome of an organism. In some cases, the tissue-specific marker comprises a differentiated methylated region (DMR) which is identified based on the methylation status of one or more differentiated methylation sites contained within the marker polynucleotide sequence. In some cases, the one or more differentiated methylation sites comprise one or more CpG sites. In some cases, the one or more differentiated methylation sites comprise one or more non-CpG sites. In some cases, a tissue-specific marker as discussed herein can be referred to as a target sequence.
In some cases, the differentiated methylation sites of the tissue-specific marker have a first methylation status in a first tissue of the organism, whereas a second methylation status in a different second tissue of the organism. The first and second methylation statuses can be different so that the first and second tissues can be differentiated based on the methylation status of the tissue-specific marker. In some cases, the differentiated methylation sites of the tissue-specific marker have a first methylation status in a first tissue of the organism, whereas a second methylation status in all other tissues of the organism. The first and second methylation statuses can be different so that the first tissue can be differentiated from all other tissues of the organism based on the methylation status of the tissue-specific marker.
In some cases, the tissue-specific marker comprises at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 12, at least 15, at least 20, at least 25, at least 30, at least 50 differentiated methylation sites. In some cases, the tissue-specific marker comprises at least 5 differentiated methylation sites. A methylated nucleotide or a methylated nucleotide base can refer to the presence of a methyl moiety on a nucleotide base, where the methyl moiety is not present in a recognized typical nucleotide base. For example, cytosine does not contain a methyl moiety on its pyrimidine ring, but 5-methylcytosine contains a methyl moiety at position 5 of its pyrimidine ring. In this case, cytosine is not a methylated nucleotide and 5-methylcytosine is a methylated nucleotide. In another example, thymine contains a methyl moiety at position 5 of its pyrimidine ring, however, for purposes herein, thymine is not considered a methylated nucleotide when present in DNA since thymine is a typical nucleotide base of DNA. Typical nucleoside bases for DNA are thymine, adenine, cytosine and guanine. Typical bases for RNA are uracil, adenine, cytosine and guanine. Correspondingly a “methylation site” can be the location in the target gene nucleic acid region where methylation has, or has the possibility of occurring. For example a location containing CpG is a methylation site wherein the cytosine may or may not be methylated. A “site” can correspond to a single site, which can be a single base position or a group of correlated base positions, e.g., a CpG site. A methylation site can refer to a CpG site, or a non-CpG site of a DNA molecule that has the potential to be methylated. A CpG site can be a region of a DNA molecule where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5′ to 3′ direction and that is susceptible to methylation either by natural occurring events in vivo or by an event instituted to chemically methylate the nucleotide in vitro. A non-CpG site can be a region that does not have a CpG dinucleotide sequence but is also susceptible to methylation either by naturally occurring events in vivo or by an event instituted to chemically methylate the nucleotide in vitro. A locus or region can correspond to a region that includes multiple sites.
The methylation status of the tissue-specific maker can comprise methylation density for individual sites within the marker region, a distribution of methylated/unmethylated sites over a contiguous region within the marker, a pattern or level of methylation for each individual methylation site within the marker that contains more than one sites, and non-CpG methylation. In some cases, the methylation status of the tissue-specific maker comprises methylation level (or methylation density) for individual differentiated methylation sites. The methylation density can refer to, for a given methylation site, a fraction of nucleic acid molecules methylated at the given methylation site over the total number of nucleic acid molecules of interest that contain such methylation site. For instance, the methylation density of a first methylation site in liver tissue can refer to a fraction of liver DNA molecules methylated at the first site over the total liver DNA molecules. In some cases, the methylation status comprises coherence of methylation/unmethylation status among individual differentiated methylation sites.
In some cases, the tissue-specific marker comprises methylation sites that are hypermethylated in a first tissue, but are hypomethylated in a second tissue. For instance, the tissue-specific marker can comprise one or more methylation sites that are hypermethylated in liver tissue, by which it can mean that one or more methylation sites have an at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, or at least 95% methylation density in liver tissue; in contrast, the one or more methylation sites can have an at most 50%, at most 40%, at most 30%, at most 20%, at most 10%, at most 5%, or 0% methylation density in other tissues, such as, but not limited to, blood cells, lung, esophagus, stomach, small intestines, colon, pancreas, urinary bladder, heart, and brain. The tissue-specific marker can comprise at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 12, at least 15, at least 20, at least 25, at least 30, at least 50 methylation sites that are hypermethylated in a first tissue, but hypomethylated in a second tissue. In some cases, the tissue-specific marker comprises at least 5 methylation sites that are hypermethylated in a first tissue, but hypomethylated in a second tissue. The tissue-specific marker can comprise at most 300 base-pairs (bp), at most 250 bp, at most 225 bp, at most 200 bp, at most 190 bp, at most 185 bp, at most 180 bp, at most 175 bp, at most 170 bp, at most 169 bp, at most 168 bp, at most 167 bp, at most 166 bp, at most 165 bp, at most 164 bp, at most 163 bp, at most 162 bp, at most 161 bp, at most 160 bp, at most 150 bp, at most 140 bp, at most 120 bp, or at most 100 bp. In some cases, the tissue-specific marker comprises at most 166 bp.
In some cases, the tissue-specific marker comprises methylation sites that are hypomethylated in a first tissue, but are hypermethylated in a second tissue. In some cases, the tissue-specific marker comprises methylation sites that are hypomethylated in a first tissue, but are hypermethylated in other tissues. For instance, the tissue-specific marker can comprise one or more methylation sites that are hypomethylated in liver tissue, by which it can mean that one or more methylation sites have an at most 50%, at most 40%, at most 30%, at most 20%, at most 10%, at most 5%, or 0% methylation density in liver tissue; in contrast, the one or more methylation sites can have an at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, or at least 95% methylation density in other tissues, such as, but not limited to, blood cells, lung, esophagus, stomach, small intestines, colon, pancreas, urinary bladder, heart, and brain. The tissue-specific marker can comprise at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 12, at least 15, at least 20, at least 25, at least 30, at least 50 methylation sites that are hypomethylated in a first tissue, but hypermethylated in a second tissue. In some cases, the tissue-specific marker comprises at least 5 methylation sites that are hypomethylated in a first tissue, but hypermethylated in a second tissue.
As used herein, the term “identity” or “percent identity” between two or more nucleotide or amino acid sequences can be determined by aligning the sequences for optimal comparison purposes (e.g., gaps can be introduced in the sequence of a first sequence). The nucleotides at corresponding positions can then be compared, and the percent identity between the two sequences can be a function of the number of identical positions shared by the sequences (i.e., % identity=# of identical positions/total # of positions×100). For example, a position in the first sequence may be occupied by the same nucleotide as the corresponding position in the second sequence, then the molecules are identical at that position. The percent identity between the two sequences may be a function of the number of identical positions shared by the sequences, taking into account the number of gaps, and the length of each gap, which need to be introduced for optimal alignment of the two sequences. In some cases, the length of a sequence aligned for comparison purposes is at least about: 30%, 40%, 50%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 95%, of the length of the reference sequence. A BLAST® search can determine homology between two sequences. The homology can be between the entire lengths of two sequences or between fractions of the entire lengths of two sequences. The two sequences can be genes, nucleotides sequences, protein sequences, peptide sequences, amino acid sequences, or fragments thereof. The actual comparison of the two sequences can be accomplished by well-known methods, for example, using a mathematical algorithm. A non-limiting example of such a mathematical algorithm can be those described in Karlin, S. and Altschul, S., Proc. Natl. Acad. Sci. USA, 90-5873-5877 (1993). Such an algorithm can be incorporated into the NBLAST and XBLAST programs (version 2.0), as described in Altschul, S. et al., Nucleic Acids Res., 25:3389-3402 (1997). When utilizing BLAST and Gapped BLAST programs, any relevant parameters of the respective programs (e.g., NBLAST) can be used. For example, parameters for sequence comparison can be set at score=100, word length=12, or can be varied (e.g., W=5 or W=20). Other examples include the algorithm of Myers and Miller, CABIOS (1989), ADVANCE, ADAM, BLAT, and FASTA.
In some embodiments, methods of identifying tissue-specific markers can comprise comparing methylation status across the genome among different tissue samples. Publicly available databases, such as, databases from RoadMap Epigenomics Project (Roadmap Epigenomics Consortium et al. Nature 2015; 518:317-30) and BLUEPRINT project (Martens et al. Haematologica 2013; 98:1487-9), can be utilized for bioinformatics analysis in order to screen for potential tissue-specific markers. In some cases, experimental validation is desirable. For instance, methylation-aware sequencing, such as bisulfite sequencing, can be performed to validate the methylation status among different tissues. In some cases, methylation-specific amplification can also be used for a relatively more target-orientated validation.
As used herein, the term “differentially methylated region” or “DMR” refers to a region in chromosomal DNA that is differentially methylated (e.g., at a CpG motif) between said species of DNA and the other DNA with which it is admixed in the sample. For example in one embodiment, the DMRs used in the present invention are differentially methylated between fetal and maternal DNA, or are differentially methylated between tumor-derived and non-tumor-derived DNA from the same individual. In particular embodiments of the present invention, the DMRs are hypermethlyated in fetal DNA and hypo methylated in maternal DNA, or are hypermethylated in tumor-derived DNA and hypomethylated in DNA that is derived from non-tumor tissue of the individual. That is, in such regions exhibit a greater degree (i.e., more) methylation in said species of DNA (e.g., the fetal or tumor cfDNA) as compared to the other DNA (e.g., maternal or non-tumor DNA), such as about 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 100% or, or more of, the sites available for methylation at a given DMR are methylated in said species of DNA as compared to the same sites in the other DNA.
Methylation of DNA at CpG dinucleotides is one of the most important epigenetic modifications in mammalian cells. Short regions of DNA in which the frequency of 5′-CG-3′ (CpG) dinucleotides are higher than in other regions of the genome are called CpG islands (Bird, 1986). CpG islands often harbor the promoters of genes and play a pivotal role in the control of gene expression. In normal tissue, CpG islands are usually unmethylated but a subset of islands becomes methylated during tumor development (Jones and Baylin, 2002; Costello et al., 2000; Esteller et al., 2001). Methyl-CpG binding domain (MBD) proteins specifically recognize methylated DNA sequences and are essential components of regulatory complexes that mediate transcriptional repression of methylated DNA (Hendrich and Bird, 2000; Wade, 2001). One of the best-characterized members of the MBD protein family is MBD2. MBD2 has two isoforms, MBD2a and MBD2b, which are alternatively translated from the same mRNA (Hendrich and Bird, 1998). Recent studies indicate that interacting proteins can modulate the methylated DNA-binding ability of the MBD2 protein (Jiang et al., 2004). MBD3L1 interacts with MBD2b in vivo and in vitro and promotes the formation of larger methylated-DNA binding complexes (Jiang et al., 2004).
In some embodiments, epigenetic modifications (cytosine methylation) on multiple methylation sites of single nucleic acid molecules are analyzed by determining the methylation status of the methylation sites covered by the methylation haplotype. As used herein, “methylation status” refers to methylation or unmethylation of a cytosine residue, for example, in a CpG dinucleotide, or in other contexts, e.g., CHG, CHH, etc. Methylated cytosine may be a variety of forms, for example, 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC) and 5-carboxycytosine (5caC), etc. A variety of protocols are available for the analysis of methylation status, with or without target enrichment (reviewed in Plongthongkum et al. 2014). For example, reduced representation bisulphite sequencing (RRBS), methylation restriction enzyme sequencing (MRE-seq), methylation DNA immunoprecipitation sequencing (MeDIP-seq, Papageorgiou et al. 2010), methyl-CpG-binding domain protein sequencing (MBD-seq), methylation DNA capture sequencing (MethylCap-seq), etc. In some embodiments, methylation status can be obtained by the standard short-read sequencing platforms, such as Illumina's HiSeq/MiSeq or LifeTech's Ion Proton, as part of the methylation assay without extra efforts and cost. In some embodiments, the methylation status may be analyzed using a target methylation sequencing technology (e.g., Bisulfite Padlock Probes, or BSPP, Deng et al, 2008; Diep et al. 2012). In some embodiments, the target nucleic acids may be enriched before the methylation status analysis. For example, micro-droplet PCR (Komori et al. 2011) or Selector probes (Johansson et al. 2011) can be used with some differences in the requirement of input materials and/or cost. Accordingly, each of the foregoing approaches may be utilized in some embodiments of the methods and compositions described herein.
In some embodiments, the determination of the methylation pattern, methylation frequency, or methylation status of the one or more candidate genes/loci may be accomplished by means of one or more methods selected from reverse-phase HPLC, thin-layer chromatography, SssI methyltransferases with incorporation of labeled methyl groups, the chloracetaldehyde reaction, differentially sensitive restriction enzymes, hydrazine or permanganate treatment (m5C is cleaved by permanganate treatment but not by hydrazine treatment), bisulfate sequencing, combined bisulfate-restriction analysis, pyrosequencing, methylation-sensitive single-strand conformation analysis (MS-SSCA), high resolution melting analysis (HRM), methylation-sensitive single nucleotide primer extension (MS-SnuPE), base-specific cleavage/MALDI-TOF, methylation-specific PCR (MSP), microarray-based methods, and MspI cleavage (reviewed, e.g., in Rein, T. et al. (1998) Nucl. Acids Res. 26: 2255-2264). Methods for detecting methylation status have been described in, for example U.S. Pat. Nos. 6,214,556; 5,786,146; 6,017,704; 6,265,171; 6,200,756; 6,251,594; 5,912,147; 6,331,393; 6,605,432; 5,786,146; 6,143,504; 6,596,493; 6,884,586; 6,300,071; and 7,195,870; and U.S. Patent Application Publication Nos. 20030148327; 20030148326; 20030143606; 20050009059; and 20060292564; each of which are incorporated herein in their entirety by reference. Alternative array-based methods of methylation analysis are disclosed in U.S. Patent Application Publication No. 20050196792. See also Oakeley, E. J., (1999) Pharmacol. Ther. 84: 389-400; Fraga et al., (2002) BioTechniques 33: 632-649; and Dahl et al., (2003) Biogerontology 4: 233-250.
In some embodiments, methods for identifying methylation may be based on differential cleavage by restriction enzymes that are used. Methylation-sensitive restriction analysis followed by PCR amplification or Southern analysis have been disclosed, for example, in Huang, T. H. et al. (1997) Cancer Res. 57: 1030-1034; Zuccotti, M. et al, (1993) Meth. Enzymol. 225: 557-567; Carrel, L. et al. (1996) Am. J Med. Genet. 64: 27-30; and Chang et al. (1992) Plant Mol. Biol. Rep. 10: 362-366. In some embodiments, enzymes that include at least one CpG dinucleotide in the recognition site may be used. Enzymes with a recognition site that includes the sequence CCGG include, for example, MspI, HpaII, AgeI, XmaI, SmaI, NgoMIV, Noel, and BspEI. Enzymes with a recognition site that includes the sequence CGCG include, for example, BstUI (CGCG, MSRE), MluI (ACGCGT, MSRE), SacII (CCGCGG, MSRE), BssHII (GCGCGC, MSRE) and NruI (TCGCGA, MSRE). NotI, BstZI, CspI and EagI have two CpGs in their recognition sites and cleavage is blocked by CpG methylation. Enzymes with a recognition site that includes the sequence GCGC include, for example, HinPlI, HhaI, AfeI, KasI, NarI, SfoI, BbeI, and FspI. Enzymes with a recognition site that includes the sequence TCGA include, for example, TaqI, ClaI (MSRE), BspDI (MSRE), PaeR7I, TliI, XhoI, SalI, and BstBI. For additional enzymes that contain CpG in the recognition sequence and for information about the enzyme's sensitivity to methylation, see, for example, the New England Biolabs catalog and web site. In some aspects two restriction enzymes may have a different recognition sequence but generate identical overhangs or compatible cohesive ends. For example, the overhangs generated by cleavage with HpaII or MspI can be ligated to the overhang generated by cleavage with TaqI. Some restriction enzymes that include CpG in the recognition site are unable to cleave if the site is methylated; these are methylation sensitive restriction enzymes (MSRE). Other enzymes that contain CpG in their recognition site can cleave regardless of the presence of methylation; these are methylation insensitive restriction enzymes (MIRE). A third type of enzyme cleaves only when the recognition site is methylated, and are referred to herein as methylation dependent restriction enzymes (MDRE). Examples of MIREs that have a CpG in the recognition sequence include, for example, BsaWI (WCCGGW), BsoBI, BssSI, MspI, and TaqI. Examples of MSREs, that include a CpG in the recognition site, include AatII, AciI, AclI, AfeI, AgeI, AscI, AvaI, BmgBI, BsaAI, BsaHI, BspDI, ClaI, EagI, FseI, PauI, HaeIII, HpaII, HinPII, MluI, NarI, NotI, NruI, PvuI, SacII, SalI, SmaI and SnaBI. In preferred aspects a pair of enzymes that have differential sensitivity to methylation and cleave at the same recognition sequence with one member of the pair being a MSRE and the other member being MIRE is used. Still other enzymes include BthCI, GlaI, HpaI, HinPlI, DpnI, MboI, ChaI and BstKTI.
In some embodiments, use of digital PCR technique can allow the direct determination of the actual number of the target DNA molecules without the need of calibrators. Other technologies, such as certain sequencing-based methods, such as, but not limited to, bisulfite sequencing and non-bisulfite-based methylation-aware sequencing using the PacBio sequencing platform, can determine the relative or fractional concentration of the DNA from the target tissues in relation to other tissues. The absolute amount can refer to an absolute count of DNA molecules, or in some cases, can also refer to a concentration of DNA molecules, e.g., number, mole, or weight per volume, e.g., copies/mL, mole/L, or mg/L. The analysis of the absolute amount as provided herein can be useful in scenarios when increased amounts of DNA would be released from more than one type of tissues. Methylation deconvolution analysis, based on sequencing of cell-free nucleic acid molecules, such as disclosed in U.S. patent application Ser. No. 14/803,692, on the other hand, can provide readout of tissue of origin of cell-free nucleic acids in the form of fractional contribution, e.g., a first tissue contributes A % of cell-free nucleic acids from a biological sample, and a second tissue contributes B % of cell-free nucleic acids from the same biological sample.
In some embodiments, technologies like real-time PCR, sequencing and microarray for methylation analysis of cell-free nucleic acids may be used. In some cases, the absolute number of cell-free nucleic acids harboring a tissue-specific marker, such as counting positive reactions in a digital PCR assay, may not be derived directly from methylation analysis by some technologies. However, such absolute number can be calculated indirectly based on concentrations (relative or fractional) of cell-free nucleic acids harboring tissue-specific markers, for instance, by taking the total number or concentration of cell-free nucleic acids in a given volume of biological sample into account. In some cases, the sequencing that can be used in the methods provided herein can include chain termination sequencing, hybridization sequencing, Illumina sequencing (e.g., using reversible terminator dyes), ion torrent semiconductor sequencing, mass spectrophotometry sequencing, massively parallel signature sequencing (MPSS), Maxam-Gilbert sequencing, nanopore sequencing, polony sequencing, pyrosequencing, shotgun sequencing, single molecule real time (SMRT) sequencing, SOLiD sequencing (hybridization using four fluorescently labeled di-base probes), universal sequencing, or any combination thereof. Microarrays having probes targeting methylation sites can also be used for analyzing methylation status of the cell-free DNA molecules in the methods provided herein.
In some embodiments, bisulfite sequencing is used for generating methylation data at single-base resolution. The term “bisulfite conversion” refers to a biochemical process for converting unmethylated cytosine residue to uracil or thymine residues, whereby methylated cytosine residues are preserved. “Bisulfite conversion” may be carried out computationally from a nucleic acid sequence contained in a computer file (such as those in FASTA, FASTQ or any file format known in the art), wherein all cytosine residues in a sequence of interest are changed to thymine or uracil residues. Exemplary reagents for bisulfite conversion include sodium bisulfite and magnesium bisulfite. “Bisulfite reagent” refers to a reagent comprising bisulfite, disulfite, hydrogen sulfite or combinations thereof, useful as disclosed herein to distinguish between methylated and unmethylated CpG dinucleotide sequences. One way to obtain such methylation data is to sequence the entire epigenome directly. Due to the difficulty in mapping bisulfite converted sequence reads and the methylation heterogeneity in a cell population, approximately 100 gigabases (Gb) of sequence data would be needed to generate a high-resolution human DNA methylation map (Lister, R. et al., (2009) Nature, 462(7271): 315-322). Other methylation profiling approaches include array capture (Hodges, E. et al., (2009) Genome Res. 19(9): 1593-1605), padlock probe capture (Deng, J. et al., (2009) Nat. Biotech. 27: 353-360; Ball, M. P. et al., (2009) Nat, Biotech., 27(4): 361-368) and reduced representation bisulfite sequencing (Gu et al., (2010) Nat. Methods 7(2): 133-136).
In particular, bisulfite sequencing involves conversion of unmethylated cytosine to uracil or thymine through a three-step process during sodium bisulfite modification. The steps are sulfonation to convert cytosine to cytosine sulfonate, deamination to convert cytosine sulfonate to uracil sulfonate or thymine sulfonate and alkali desulfonation to convert uracil sulfonate to uracil or thymine sulfonate to thymine. Conversion of methylated cytosine is much slower and is not observed at significant levels in a 4-16 hour reaction (Clark, S. J. et al, (1994) Nucleic Acids Res., 22(15): 2990-7). If the cytosine is methylated it will remain a cytosine. If the cytosine is unmethylated, it will be converted to uracil or thymine. When the modified strand is copied, for example, extension of a locus specific primer, a random or degenerate primer or a primer to an adaptor, a G will be incorporated in the interrogation position (opposite the C being interrogated) if the C was methylated and an A will be incorporated in the interrogation position if the C was unmethylated. When the double stranded extension product is amplified, those Cs that were converted to Us or Ts and resulted in incorporation of A in the extended primer will be replaced by Ts during amplification. Those Cs that were not modified and resulted in the incorporation of G will remain as C. Bisulfite treatment can degrade the DNA making it difficult to amplify. The sequence degeneracy resulting from the treatment also complicates primer design. The treatment may also result in incomplete desulfonation, depurination and other as yet uncharacterized DNA damage, making downstream processing more challenging. The treatment can also result in preferential amplification of unmethylated DNA relative to methylated DNA. This may be mitigated by increasing the PCR extension time.
Kits for DNA bisulfite modification are commercially available from, for example, Human Genetic Signatures' Methyleasy and Chemicon's CpGenome Modification Kit. See also, WO04096825, which describes bisulfite modification methods and Olek, A. et al. (1994) Nucl. Acids Res. 24(24): 5064-6, which discloses methods of performing bisulfite treatment and subsequent amplification on material embedded in agarose beads. In some aspects a catalyst such as diethylenetriamine may be used in conjunction with bisulfite treatment, see Komiyama, M. and Oshima, S., (1994) Tetrahedron Lett. 35(44): 8185-8188. Diethylenetriamine has been shown to catalyze bisulfite ion-induced deamination of 2′-deoxycytidine to 2′-deoxyuridine at pH 5 efficiently. Other catalysts include ammonia, ethylene-diamine, 3,3′-diaminodipropylamine, and spermine. In some aspects, deamination is performed using sodium bisulfite solutions of 3-5 M with an incubation period of 12-16 hours at about 50° C. A faster procedure has also been reported using 9-10 M bisulfite pH 5.4 for about 10 minutes at 90° C., see Hayatsu, H. et al., (2004) Proc. Jpn. Acad. Ser. B 80(4): 189-194.
Bisulfite treatment allows the methylation status of cytosines to be detected by a variety of methods. For example, any method that may be used to detect a single nucleotide polymorphism (SNP) may be used, for examples, see Syvanen, A. C. (2001) Nature Rev. Gen. 2(12): 930-942. In a preferred aspect, bisulfite sequencing methods, systems, and computer program products described herein may provide information regarding not only methylation frequencies or methylation status of a sequence of interest at single base resolution, but also information regarding SNPs, preferably in the same sequencing run. Other methods such as single base extension (SBE) may be used or hybridization of sequence specific probes similar to allele specific hybridization methods. “Variants” or “alleles” generally refer to one of a plurality of species each encoding a similar sequence composition, but with a degree of distinction from each other. The distinction may include any type of variation known to those of ordinary, skill in the related art, that include, but are not limited to, polymorphisms such as SNPs, insertions or deletions (the combination of insertion/deletion events are also referred to as “indels”), differences in the number of repeated sequences (also referred to as tandem repeats), and structural variations. Detection of such variants or alleles is also within the ambit of the subject matter described herein.
In some embodiments, the disclosed methods use commercial sequencing by synthesis platforms, such as the Genome Sequencer from Roche/454 Life Sciences, the Genome Analyzer from Illumina/Solexa, the SOLiD system from Applied BioSystems, Pacific Biosystems and the Heliscope system from Helicos Biosciences. Exemplary sequencing platforms may have one or more of the following features: 1) four differently optically labeled nucleotides are utilized (e.g., Genome Analyzer); 2) sequencing-by-ligation is utilized (e.g., SOLiD); 3) pyrosequencing is utilized (e.g., Roche/454); and 4) four identically optically labeled nucleotides are utilized (e.g., Helicos).
Such sequencing reactions and assays include sequencing by ligation methods commercialized by Applied Biosystems (e.g., SOLiD sequencing). In general, double stranded fragment nucleic acid molecules can be prepared by the methods described herein, and then incorporated into a water-in-oil emulsion along with polystyrene beads and amplified, for example by PCR. In some cases, alternative amplification methods can be employed in the water-in-oil emulsion such as any of the methods provided herein. The amplified product in each water microdroplet formed by the emulsion can interact, bind, or hybridize with the one or more beads present in that microdroplet leading to beads with a plurality of amplified products of substantially one sequence. When the emulsion is broken, the beads float to the top of the sample and are placed onto an array. The methods can include a step of rendering the nucleic acid bound to the beads single-stranded or partially single stranded. Sequencing primers are then added along with a mixture of four different fluorescently labeled oligonucleotide probes. The probes bind specifically to the two bases in the nucleic acid molecule to be sequenced immediately adjacent and 3′ of the sequencing primer to determine which of the four bases are at those positions. After washing and reading the fluorescence signal from the first incorporated probe, a ligase is added. The ligase cleaves the oligonucleotide probe between the fifth and sixth bases, removing the fluorescent dye from the nucleic acid molecule to be sequenced. The whole process is repeated using a different sequence primer until all of the intervening positions in the sequence are imaged. The process allows the simultaneous reading of millions of DNA fragments in a “massively parallel” manner. This “sequence-by-ligation” technique uses probes that encode for two bases rather than just one, allowing error recognition by signal mismatching and leading to increased base determination accuracy.
Alternative sequencing methods include sequencing by synthesis methods commercialized by 454/Roche Life Sciences including but not limited to the methods and apparatus described in Margulies et al., Nature (2005) 437:376-380 (2005); and U.S. Pat. Nos. 7,244,559; 7,335,762; 7,211,390; 7,244,567; 7,264,929; and 7,323,305. In general, double stranded fragment nucleic acid molecules can be prepared by the methods described herein, immobilized onto beads, and compartmentalized in a water-in-oil PCR emulsion. In some cases, alternative amplification methods can be employed in the water-in-oil emulsion such as any of the methods provided herein. When the emulsion is broken, amplified fragments remain bound to the beads. The methods can include a step of rendering the nucleic acid bound to the beads single stranded or partially single stranded. The beads can be enriched and loaded into wells of a fiber optic slide so that there is approximately 1 bead in each well. Nucleotides are flowed across and into the wells in a fixed order in the presence of polymerase, sulfhydrolase, and luciferase. Addition of nucleotides complementary to the target strand can result in a chemiluminescent signal that is recorded, such as by a camera. The combination of signal intensity and positional information generated across the plate allows software to determine the DNA sequence.
Other sequencing methods include those commercialized by Hclicos BioSciences Corporation (Cambridge, Mass.) as described in U.S. patent application Ser. No. 11/167,046, and U.S. Pat. Nos. 7,501,245; 7,491,498; 7,276,720; and in U.S. Patent Application Publication Nos. 20090061439; 20080087826; 20060286566; 20060024711; 20060024678; 20080213770; and 20080103058. In general, double stranded fragment nucleic acid molecules can be isolated and purified, then immobilized onto a flow-cell surface. The methods can include a step of rendering the nucleic acid bound to the flow-cell surface stranded or partially single stranded. Polymerase and labeled nucleotides are then flowed over the immobilized DNA. After fluorescently labeled nucleotides are incorporated into the DNA strands by a DNA polymerase, the surface is illuminated with a laser, and an image is captured and processed to record single molecule incorporation events to produce sequence data.
Other methods include sequencing by ligation methods commercialized by Dover Systems. Generally, oligonucleotides, primers, and probes can be prepared by the methods described herein. The nucleic acid molecules can then be amplified in an emulsion in the presence of magnetic beads. Any amplification methods can be employed in the water-in-oil emulsion. The resulting beads with immobilized clonal nucleic acid polonies are then purified by magnetic separation, capped, amine functionalized, and covalently immobilized in a series of flow cells. The methods can include a step of rendering the nucleic acid bound to the flow-cell surface stranded or partially single stranded. A series of anchor primers are flowed through the cell, where they hybridize to the synthetic oligonucleotide sequences at the 3′ or 5′ end of proximal or distal genomic DNA tags. Once an anchor primer is hybridized, a mixture of fully degenerate nonanucleotides (“nonamers”) and T4 DNA ligase is flowed into the cell. Each of the nonamer mixture's four components is labeled with one of four fluorophores, which correspond to the base type at the query position. The fluorophore-tagged nonamers selectively ligate onto the anchor primer, providing a fluorescent signal that identifies the corresponding base on the genomic DNA tag. Once the probes are ligated, fluorescently labeling the beads, the array is imaged in four colors. Each bead on the array will fluoresce in only one of the four images, indicating whether there is an A, C, G, or T at the position being queried. After imaging, the array of annealed primer-fluorescent probe complex, as well as residual enzyme, are chemically striped using guanidine HCl and sodium hydroxide. After each cycle of base reads at a given position have been completed, and the primer-fluorescent probe complex has been stripped, the anchor primer is replaced, and a new mixture of fluorescently tagged nonamers is introduced, for which the query position is shifted one base further into the genomic DNA tag. Seven bases are queried in this fashion, with the sequence performed from the 5′ end of the proximal tag, followed by six base reads with a different anchor primer from the 3′ end of the proximal tag, for a total of 13 base pair reads for this tag. This sequence is then repeated for the 5′ and 3′ ends of the distal tag, resulting in another 13 base pair reads. The ultimate result is a read length of 26 bases (thirteen from each of the paired tags). However, it is understood that this method is not limited to 26 base read lengths.
Other methods for sequencing include those commercialized by Illumina as described U.S. Pat. Nos. 5,750,341; 6,306,597; and 5,969,119. In general, oligonucleotides, primers, and probes can be prepared by the methods described herein to produce amplified nucleic acid sequences tagged at one (e.g., (A)/(A′) or both ends (e.g., (A)/(A′) and (C)/(C′)). In some cases, single stranded nucleic acid tagged at one or both ends is amplified by the methods described herein (e.g., by SPIA or linear PCR). The resulting nucleic acid is then denatured and the single stranded amplified nucleic acid molecules are randomly attached to the inside surface of flow-cell channels. Unlabeled nucleotides are added to initiate solid-phase bridge amplification to produce dense clusters of double-stranded DNA. To initiate the first base sequencing cycle, four labeled reversible terminators, primers, and DNA polymerase are added. After laser excitation, fluorescence from each cluster on the flow cell is imaged. The identity of the first base for each cluster is then recorded. Cycles of sequencing are performed to determine the fragment sequence one base at a time. For paired-end sequencing, such as for example, when the nucleic acid molecules are labeled at both ends by the methods described herein, sequencing templates can be regenerated in-situ so that the opposite end of the fragment can also be sequenced.
Still other sequencing methods include those commercialized by Pacific Biosciences as described in U.S. Pat. Nos. 7,462,452; 7,476,504; 7,405,281; 7,170,050; 7,462,468; 7,476,503; 7,315,019; 7,302,146; 7,313,308; and U.S. Patent Application Publication Nos. US20090029385; US20090068655; US20090024331; and US20080206764. In general, oligonucleotides, primers and probes can be prepared by the methods described herein. Target nucleic acid molecules can then be immobilized in zero mode waveguide arrays. The methods may include a step of rendering the nucleic acid bound to the waveguide arrays single stranded or partially single stranded. Polymerase and labeled nucleotides are added in a reaction mixture, and nucleotide incorporations are visualized via fluorescent labels attached to the terminal phosphate groups of the nucleotides. The fluorescent labels are clipped off as part of the nucleotide incorporation. In some cases, circular templates are utilized to enable multiple reads on a single molecule.
Another example of a sequencing technique that can be used in the methods described herein is nanopore sequencing (see e.g. Soni, G. V. and Meller, A. (2007) Clin. Chem. 53: 1996-2001). A nanopore can be a small hole of the order of one nanometer in diameter. Immersion of a nanopore in a conducting fluid and application of a potential across it can result in a slight electrical current due to conduction of ions through the nanopore. The amount of current that flows is sensitive to the size of the nanopore. As a DNA molecule passes through a nanopore, each nucleotide on the DNA molecule obstructs the nanopore to a different degree. Thus, the change in the current passing through the nanopore as the DNA molecule passes through the nanopore can represent a reading of the DNA sequence.
Yet another example of a sequencing technique that can be used is semiconductor sequencing provided by Ion Torrent (e.g., using the Ion Personal Genome Machine (PGM)). Ion Torrent technology can use a semiconductor chip with multiple layers, e.g., a layer with micro-machined wells, an ion-sensitive layer, and an ion sensor layer. Nucleic acids can be introduced into the wells, e.g., a clonal population of single nucleic can be attached to a single bead, and the bead can be introduced into a well. To initiate sequencing of the nucleic acids on the beads, one type of deoxyribonucleotide (e.g., dATP, dCTP, dGTP, or dTTP) can be introduced into the wells. When one or more nucleotides are incorporated by DNA polymerase, protons (hydrogen ions) are released in the well, which can be detected by the ion sensor. The semiconductor chip can then be washed and the process can be repeated with a different deoxyribonucleotide. A plurality of nucleic acids can be sequenced in the wells of a semiconductor chip. The semiconductor chip can comprise chemical-sensitive field effect transistor (chemFET) arrays to sequence DNA (for example, as described in U.S. Patent Application Publication No. 20090026082). Incorporation of one or more triphosphates into a new nucleic acid strand at the 3′ end of the sequencing primer can be detected by a change in current by a chemFET. An array can have multiple chemFET sensors.
In some embodiments, the sequence reads may be aligned to a reference genome using known methods in the art to determine alignment position information. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information may also include sequence read length, which can be determined from the beginning position and end position. A region in the reference genome may be associated with a gene or a segment of a gene.
From the sequence reads, the location and methylation state for each of CpG site may be determined based on alignment to a reference genome. Further, a methylation state vector for each fragment may be generated specifying a location of the fragment in the reference genome (e.g., as specified by the position of the first CpG site in each fragment, or another similar metric), a number of CpG sites in the fragment, and the methylation state of each CpG site in the fragment whether methylated (e.g., denoted as M), unmethylated (e.g., denoted as U), or indeterminate (e.g., denoted as I). The methylation state vectors may be stored in temporary or persistent computer memory for later use and processing. Further, duplicate reads or duplicate methylation state vectors from a single subject may be removed. In an additional embodiment, it may be determined that a certain fragment has one or more CpG sites that have an indeterminate methylation status. Such fragments may be excluded from later processing or selectively included where downstream data model accounts for such indeterminate methylation statuses.
In some embodiments, molecular inversion probes (MIP), described in Hardenbol, P. et al. Genome Res. 15:269-275 (2005) and in U.S. Pat. No. 6,858,412, may be used to determine methylation status after methylation dependent modification. A MIP may be designed for each cytosine to be interrogated. In a preferred aspect the MIP includes a locus specific region that hybridizes upstream and one that hybridizes downstream of an interrogation site and can be extended through the interrogation site, incorporating a base that is complementary to the interrogation position. The interrogation position may be the cytosine of interest after bisulfite modification and amplification of the region and the detection can be similar to detection of a polymorphism. Separate reactions may be performed for each NTP so extension only takes place in the reaction containing the base corresponding to the interrogation base or the different products may be differentially labeled.
Padlock Probe (PP) technology is a multiplex genomic enrichment method allowing for accurate targeted high-throughput sequencing. PP technology has been used to perform highly multiplexed genotyping, digital allele quantification, targeted bisulfate sequencing, and exome sequencing. Padlock probe technology may utilize a linear oligonucleotide molecule with two binding sequences at each end joined by a common linker sequence. The probe's binding arms may be hybridized several base pairs apart surrounding a target single-stranded genomic DNA region. A DNA polymerase (with each of the four standard dNTP molecules) may be used to fill in the gap between the two binding arms, and a DNA ligase may be used to circularize the resulting molecule. A mixture of exonucleases may then used to digest all arm; the resulting circular molecules can be amplified using rolling circle amplification or polymerase chain reaction to generate a DNA library compatible with modern high-throughput sequencers. Padlock probes are generally designed to have binding arms with complementary sequence around a chosen target region of approximately 100-200 base pairs; each binding arm is generally designed to have a specific DNA melting temperature.
The term “padlock probe” (PLP) refers to circularized nucleic acid molecules which may combine specific molecular recognition and universal amplification (or specific amplification and general recognition), thereby increasing sensitivity and multiplexing capabilities without limiting the range of potential target organisms. PLPs are long oligonucleotides of approximately 100 bases (but can be of any length), containing target complementary regions (referred to herein as “target-capturing sequences”) at both their 5′ and 3′ ends. These regions recognize adjacent sequences on the target nucleic acid sequence (Nilsson, M., et al. (1994) Science 265: 2085-2088) and may also contain “binding arms” which comprise “extension arms” having priming sites (e.g., universal priming sites”), sites recognized by ligase enzymes, and unique sequence identifiers, sometimes referred to as a “ZipCode” or “barcode”. Upon hybridization, the ends of the probes are situated into adjacent position, and can be joined by enzymatic ligation at the ligation sites (also referred to herein as “ligation arms”) converting the probe into a circular molecule (also known in the art and referred to herein as an “amplicon”) that is threaded on the target strand. This ligation and the resulting circular molecule can only take place when both ligation arm and extension arm segments recognize their target sequences correctly. Non-circularized probes may be removed by exonuclease treatment, while the circularized entities may be amplified with universal primers, which may or may not contain barcode or ZipCode sequences. This mechanism ensures reaction specificity, even in a complex nucleotide extract with a large number of padlock probes. Subsequently, the target-specific products are detected by a universal cZipCode microarray (Shoemaker, D. D., et al., (1996) Nat. Genet. 14: 450-456). PLPs have high specificity and multiplexing capabilities in genotyping assays (Hardenbol, P., et al., (2003) Nat. Biotechnol., 21: 673-678.).
In some embodiments, the disclosed methods characterize the DNA methylation profile of a sample using bisulfite sequencing and include mapping a genomic sequence of interest to determine methylation status, methylation frequency, and detection of single nucleotide polymorphisms. During bisulfite sequencing, all cytosines present in the DNA molecule except those that are methylated are converted to thymines. Almost every methylated cytosine is present as a cytosine-guanine dinucleotide (CpG), though not all CpGs are methylated. In such embodiments, after the genome/chromosome is loaded into memory, the software application or computer program product performs an in silico bisulfite conversion, computationally converting all cytosines except those present in a CpG to thymines. The application or program then designs probes as previously, but penalizes the inclusion of CpG sites in each binding arm. During linker sequence insertion, the software application or computer program product generates multiple probes for those arm pairs containing a CpG; one probe assumes a methylated state (and contains a CpG dinucleotide) while the other assumes an unmethylated state (and contains a TpG instead). This procedure allows for efficient targeted bisulfite sequencing of hundreds of thousands of CpG sites in parallel.
In some embodiments, the disclosed methods use the probes or primers to obtain sequence reads of a target genome or sequence of interest by bisulfite sequencing and loading it into memory. A software application or computer program product encodes the sequence reads by predicting the forward and reverse orientation of each of the sequence reads to generate at least one forward sequence read and at least one reverse sequence read. The forward and reverse sequence reads are then converted by the software application or computer program product by computationally changing all cytosine residues in the forward sequence reads to thymine residues in silico, and changing all guanine residues to adenine residues in the reverse sequence reads. The bisulfite-converted genome sequence and all forward and reverse sequence reads are then aligned computationally by an alignment software application or computer program (e.g., ELAND, SOAP2Align, Bowtie, BWA, BLAST or any other alignment program known in the art). The alignment application or program can be a stand-alone application or integrated into the system, software application or computer program product described herein. The aligned sequences are then combined to create a map of the target genomic sequence. The software application or computer program product then analyzes and computes methylation frequencies or methylation status of the mapped sequences in entirety. In preferred embodiments, the mapped sequences may also be analyzed by the software application or computer program product for the presence of single nucleotide polymorphisms. Because bisulfite sequencing provides sequence read information at single-base resolution, this technique (and modifications thereof described in the methods, systems, and computer program products described herein) is particularly advantageous for calculating methylation frequencies and detecting SNPs in a single sequencing reaction.
In one example, to specifically capture an arbitrary subset of genomic targets for single-molecule bisulfite sequencing and digital quantification of DNA methylation at single-nucleotide resolution, the disclosed method may include: A set of 30,000 padlock probes designed to assess the methylation state of ˜66,000 CpG sites within 2,020 CpG islands on human chromosome 12, chromosome 20, and 34 [selected regions]. To investigate epigenetic differences associated with de-differentiation, methylation in three human fibroblast lines and eight human pluripotent stem cell lines was compared. Chromosome-wide methylation patterns were similar among all lines studied, but cytosine methylation was slightly more prevalent in the pluripotent cells than in the fibroblasts. Induced pluripotent stem (iPS) cells appeared to display more methylation than embryonic stem cells. In fibroblasts and pluripotent cells, 288 regions were methylated differently. This targeted approach is particularly useful for analyzing DNA methylation in large genomes. Padlock probes have been previously used for exon capture and resequencing (Porreca, G. J. et al. (2007) Nat. Methods 4: 931-936). This approach to targeted bisulfite sequencing involves the in situ synthesis of long (˜150 nt) oligonucleotides on programmable microarrays, followed by their cleavage and enzymatic conversion into padlock probes. A library of padlock probes was annealed to the template DNA, circularized, and amplified by PCR before shotgun sequencing. There are, however, two major challenges in performing padlock capture for bisulfite sequencing. First, bisulfite treatment converts all unmethylated cytosines into uracils, resulting in marked reduction of sequence complexity. Achieving specific target capture on bisulfite-converted DNA is more difficult than on native genomic DNA. Second, low capturing sensitivity, high bias and random losses of alleles was initially observed with the “eMIP method” previously disclosed in the art (Porreca, G. J. et al. (2007) Nat. Methods 4: 931-936). Obtaining accurate and efficient quantification of DNA methylation was not possible with the existing protocol, especially with the presence of allelic drop-outs.
An exemplary workflow of targeted bisulfite sequencing using padlock probes includes the steps of: padlock probe design, padlock probe production, multiplex capture on bisulfate-converted DNA, capture circles amplification, shotgun sequencing library construction, followed by read mapping and data analysis.
Any of the methods disclosed herein can be performed and/or controlled by one or more computer systems. In some examples, any step of the methods disclosed herein can be wholly, individually, or sequentially performed and/or controlled by one or more computer systems. Any of the computer systems mentioned herein can utilize any suitable number of subsystems. In some embodiments, a computer system includes a single computer apparatus, where the subsystems can be the components of the computer apparatus. In other embodiments, a computer system can include multiple computer apparatuses, each being a subsystem, with internal components. A computer system can include desktop and laptop computers, tablets, mobile phones and other mobile devices.
The subsystems can be interconnected via a system bus. Additional subsystems include a printer, keyboard, storage device(s), and monitor that is coupled to display adapter. Peripherals and input/output (I/O) devices, which couple to I/O controller, can be connected to the computer system by any number of connections known in the art such as an input/output (I/O) port (e.g., USB, FireWire®). For example, an I/O port or external interface (e.g., Ethernet, Wi-Fi, etc.) can be used to connect computer system to a wide area network such as the Internet, a mouse input device, or a scanner. The interconnection via system bus allows the central processor to communicate with each subsystem and to control the execution of a plurality of instructions from system memory or the storage device(s) (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems. The system memory and/or the storage device(s) can embody a computer readable medium. Another subsystem is a data collection device, such as a camera, microphone, accelerometer, and the like. Any of the data mentioned herein can be output from one component to another component and can be output to the user.
A computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface or by an internal interface. In some embodiments, computer systems, subsystem, or apparatuses can communicate over a network. In such instances, one computer can be considered a client and another computer a server, where each can be part of a same computer system. A client and a server can each include multiple systems, subsystems, or components.
The present disclosure provides computer control systems that are programmed to implement methods of the disclosure. For example, a computer system that is programmed or otherwise configured to determine an absolute amount of cell-free nucleic acid molecules from a tissue of an organism as described herein. The computer system can implement and/or regulate various aspects of the methods provided in the present disclosure, such as, for example, controlling sequencing of the nucleic acid molecules from a biological sample, performing various steps of the bioinformatics analyses of sequencing data as described herein, integrating data collection, analysis and result reporting, and data management. The computer system can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.
The computer system includes a central processing unit (CPU, also “processor” and “computer processor” herein), which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system also includes memory or memory location (e.g., random-access memory, read-only memory, flash memory), electronic storage unit (e.g., hard disk), communication interface (e.g., network adapter) for communicating with one or more other systems, and peripheral devices, such as cache, other memory, data storage and/or electronic display adapters. The memory, storage unit, interface and peripheral devices are in communication with the CPU through a communication bus, such as a motherboard. The storage unit can be a data storage unit (or data repository) for storing data. The computer system can be operatively coupled to a computer network (“network”) with the aid of the communication interface. The network can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network in some cases is a telecommunication and/or data network. The network can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network, in some cases with the aid of the computer system, can implement a peer-to-peer network, which can enable devices coupled to the computer system to behave as a client or a server.
The CPU can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions can be stored in a memory location, such as the memory. The instructions can be directed to the CPU, which can subsequently program or otherwise configure the CPU to implement methods of the present disclosure. Examples of operations performed by the CPU can include fetch, decode, execute, and writeback. The CPU can be part of a circuit, such as an integrated circuit. One or more other components of the system can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
The storage unit can store files, such as drivers, libraries and saved programs. The storage unit can store user data, e.g., user preferences and user programs. The computer system in some cases can include one or more additional data storage units that are external to the computer system, such as located on a remote server that is in communication with the computer system through an intranet or the Internet.
The computer system can communicate with one or more remote computer systems through the network. For instance, the computer system can communicate with a remote computer system of a user (e.g., a Smart phone installed with application that receives and displays results of sample analysis sent from the computer system). Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PCs (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system via the network.
Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system, such as, for example, on the memory or electronic storage unit. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor. In some cases, the code can be retrieved from the storage unit and stored on the memory for ready access by the processor. In some situations, the electronic storage unit can be precluded, and machine-executable instructions are stored on memory.
The code can be pre-compiled and configured for use with a machine having a processor adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
Aspects of the systems and methods provided herein, such as the computer system, can be embodied in programming. Various aspects of the technology can be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which can provide non-transitory storage at any time for the software programming. All or portions of the software can at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, can enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that can bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also can be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
For example a as computer-executable code, may be used. Such media can be of many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as can be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media can take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer can read programming code and/or data. Many of these forms of computer readable media can be involved in carrying one or more sequences of one or more instructions to a processor for execution.
The computer system can include or be in communication with an electronic display that comprises a user interface (UI) for providing, for example, results of sample analysis, such as, but not limited to graphic showings of relative and/or absolute amounts of cell-free nucleic acids from different tissues, control or reference amount of cell-free nucleic acids from certain tissues, comparison between detected and reference amounts, and readout of presence or absence of cancer metastasis. Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface.
Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit. The algorithm can, for example, control sequencing of the nucleic acid molecules from a sample, direct collection of sequencing data, analyzing the sequencing data, or determining a classification of pathology based on the analyses of the sequencing data.
In some cases, a sample can be obtained from a subject, such as a human subject. A sample can be subjected to one or more methods as described herein, such as performing an assay. In some cases, an assay can comprise hybridization, amplification, sequencing, labeling, epigenetically modifying a base, or any combination thereof. One or more results from a method can be input into a processor. One or more input parameters such as sample identification, subject identification, sample type, a reference, or other information can be input into a processor. One or more metrics from an assay can be input into a processor such that the processor can produce a result, such as a classification of pathology (e.g., diagnosis) or a recommendation for a treatment. A processor can send a result, an input parameter, a metric, a reference, or any combination thereof to a display, such as a visual display or graphical user interface. A processor can (i) send a result, an input parameter, a metric, or any combination thereof to a server, (ii) receive a result, an input parameter, a metric, or any combination thereof from a server, (iii) or a combination thereof. In some embodiments, an integrated diagnosis system may comprise the assay, processor, and display in a compact device and perform the diagnosis method automatically.
Aspects of the present disclosure can be implemented in the form of control logic using hardware (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software with a generally programmable processor in a modular or integrated manner. As used herein, a processor includes a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or circuit network. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement embodiments described herein using hardware and a combination of hardware and software.
Any of the software components or functions described in this application can be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques. The software code can be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission. A suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk), flash memory, and the like. The computer readable medium can be any combination of such storage or transmission devices.
Such programs can also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet. As such, a computer readable medium can be created using a data signal encoded with such programs. Computer readable media encoded with the program code can be packaged with a compatible device or provided separately from other devices (e.g., via Internet download). Any such computer readable medium can reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and can be present on or within different computer products within a system or network. A computer system can include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
Any of the methods described herein can be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Thus, embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, with different components performing a respective steps or a respective group of steps. Although presented as numbered steps, steps of methods herein can be performed at a same time or in a different order. Additionally, portions of these steps can be used with portions of other steps from other methods. Also, all or portions of a step can be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other approaches for performing these steps.
A workflow of systems and devices for sequencing nucleic acid samples are also disclosed herein. In one example, the workflow includes a device, such as a sequencer and an analytics system. The sequencer and the analytics system may work in tandem to perform one or more steps in the processes described herein.
In various embodiments, the sequencer receives an enriched nucleic acid sample. The sequencer can include a graphical user interface that enables user interactions with particular tasks (e.g., initiate sequencing or terminate sequencing) as well as one more loading stations for loading a sequencing cartridge including the enriched fragment samples and/or for loading necessary buffers for performing the sequencing assays. Therefore, once a user of the sequencer has provided the necessary reagents and sequencing cartridge to the loading station of the sequencer, the user can initiate sequencing by interacting with the graphical user interface of the sequencer. Once initiated, the sequencer performs the sequencing and outputs the sequence reads of the enriched fragments from the nucleic acid sample.
In some embodiments, the sequencer is communicatively coupled with the analytics system. The analytics system includes some number of computing devices used for processing the sequence reads for various applications such as assessing methylation status at one or more CpG sites, variant calling or quality control. The sequencer may provide the sequence reads in a BAM file format to the analytics system. The analytics system can be communicatively coupled to the sequencer through a wireless, wired, or a combination of wireless and wired communication technologies. Generally, the analytics system is configured with a processor and non-transitory computer-readable storage medium storing computer instructions that, when executed by the processor, cause the processor to process the sequence reads or to perform one or more steps of any of the methods or processes disclosed herein.
In some embodiments, the sequence reads may be aligned to a reference genome using known methods in the art to determine alignment position information. Alignment position may generally describe a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide based and an end nucleotide base of a given sequence read. Corresponding to methylation sequencing, the alignment position information may be generalized to indicate a first CpG site and a last CpG site included in the sequence read according to the alignment to the reference genome. The alignment position information may further indicate methylation statuses and locations of all CpG sites in a given sequence read. A region in the reference genome may be associated with a gene or a segment of a gene; as such, the analytics system may label a sequence read with one or more genes that align to the sequence read. In one embodiment, fragment length (or size) is determined from the beginning and end positions.
In various embodiments, for example when a paired-end sequencing process is used, a sequence read is comprised of a read pair denoted as R_1 and R_2. For example, the first read R_1 may be sequenced from a first end of a double-stranded DNA (dsDNA) molecule whereas the second read R_2 may be sequenced from the second end of the double-stranded DNA (dsDNA). Therefore, nucleotide base pairs of the first read R_1 and second read R_2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome. Alignment position information derived from the read pair R_1 and R_2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R_1) and an end position in the reference genome that corresponds to an end of a second read (e.g., R_2). In other words, the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds. In one embodiment, the read pair R_1 and R_2 can be assembled into a fragment, and the fragment used for subsequent analysis and/or classification. An output file having SAM (sequence alignment map) format or BAM (binary) format may be generated and outputted for further analysis.
Referring now to an analytics system for processing DNA samples according to one embodiment. The analytics system implements one or more computing devices for use in analyzing DNA samples. The analytics system includes a sequence processor, sequence database, model database, models, parameter database, and score engine. In some embodiments, the analytics system performs one or more steps in the processes and other process described herein.
The sequence processor generates methylation state vectors for fragments from a sample. At each CpG site on a fragment, the sequence processor generates a methylation state vector for each fragment specifying a location of the fragment in the reference genome, a number of CpG sites in the fragment, and the methylation state of each CpG site in the fragment whether methylated, unmethylated, or indeterminate via the process. The sequence processor may store methylation state vectors for fragments in the sequence database. Data in the sequence database may be organized such that the methylation state vectors from a sample are associated to one another.
Further, multiple different models may be stored in the model database or retrieved for use with test samples. In one example, a model is a trained cancer classifier for determining a cancer prediction for a test sample using a feature vector derived from anomalous fragments. The training and use of the cancer classifier is discussed elsewhere herein. The analytics system may train the one or more models and store various trained parameters in the parameter database. The analytics system stores the models along with functions in the model database.
During inference, the score engine uses the one or more models to return outputs. The score engine accesses the models in the model database along with trained parameters from the parameter database. According to each model, the score engine receives an appropriate input for the model and calculates an output based on the received input, the parameters, and a function of each model relating the input and the output. In some use cases, the score engine further calculates metrics correlating to a confidence in the calculated outputs from the model. In other use cases, the score engine calculates other intermediary values for use in the model.
In this prophetic example, an epigenetic interface analyzes a subject's genome from a number of non-intrusive sources including saliva or blood samples. Neural networks will then process the subject's genome and discover characteristics which correlate with particular phenotypes and physiological parameters. These findings are then compared to non-native data using SCA and DCT to reveal the subject genome's personal performance. A physician or health-care individual provides the subject with a personalized health assessment. The physician or health-care professional further prescribes or advises a medical treatment course, including NMN administration, to address the individual's personalized health assessment.
Using a “replacement by synonymity” approach, surprisingly, it was identified that a set of 353 CpG islands where >50% of the CpG islands had been replaced by novel, previously unreported CpG islands could retain the same performance as the unsubstituted list. Introduction of the novel CpG islands in the in-house data set significantly lowered the prevalence of known CpG islands without compromising prediction performance. Even more surprisingly, some neural network designs even favored the substituted data set, in particular, if the deep learning parameters had been optimized by parameter scanning, indicating a specific role for these CpG islands in optimizing age prediction.
In one such scenario of CpG island substitution, transmembrane signaling receptor activity was investigated. The fold enrichment is presented in Table 1. The table shows, in the third and fourth columns, how a particular CpG island list with 353 members have a higher number of mappings than expected to different GO terms related to signaling.
Steve Horvath presented a list of 353 CpG islands in a paper, Genome Biol, 2013; 14(10):R115. Horvath described predictive performance on native data using regression model as a set error of 3.6 years, indicating that DNAm age differs by less than 3.6 years in 50% of subjects. Horvath referred to this measure as (median) ‘error’—that is the median absolute difference between DNAm age and chronological age. It has been identified that relying on the total sum or error in a test set of 100 individuals, frequently trying to keep this measure under 300, would be desirable. Depending on the tissue, Horvath reported Pearson correlation coefficients ranging from 0.89 to 0.97. Horvath's CpG island list can be found in Supplementary File 3 of Horvath's paper, and starts with: cg00075967, cg00374717, cg00864867. The following three data sets were used in this example: GSE27097 (leukocytes from autism sibs, children from Atlanta, Ga.), GSE41037 (whole blood, adults from Los Angeles, Calif.), and GSE42861 (whole blood, adults from Shanghai, China), because they are large, and complement each other in their coverage of different age groups. The first step in the data processing pipeline is using a Perl script that extracts lines from the series matrix files that contain the relevant CpG islands only. Using the “int-run” programmatic interface 0.1, a standard benchmark was performed using the 353 published CpG islands vs. something called the “pro3 set”, an in-house version of this set, where 50% of the islands have been replaced by de-novo islands. A Python script trains a Keras model using 4×95-sized dense layers for 73 epochs of training, the default settling of the main parameters, but which may be modified by parameter scanning for some GO substituted CpG island lists.
An exemplary method of feature selection from data, implemented by the feature selection is disclosed herein. As described above, the raw data set, comprising raw data of a plurality of samples, is inputted into the feature selection module. The feature selection module then performs the following step: for each “Horvath learned feature” (i.e., one of the 353 CpG islands discovered by Horvath which, in combination, have a good predicting power of an individual's age), the computer system automatically searches and attempts to find a “synonymous feature” from the rest of the CpG islands. The “synonymous feature” may be required to be highly correlated with the Horvath learned feature in the plurality of samples and is located on a different gene. If a synonymous feature is found, the computer system stores the synonymous feature (CpG island).
After finding as many synonymous features as possible, the feature selection module then performs the following step: for each sample, the computer system generates a modified feature list by replacing the Horvath learned feature with the found synonymous feature. The feature selection module then outputs a training data set, which comprises the modified feature lists of the plurality of samples.
Using a higher order GO term (e.g., GO:0008152, metabolic process) as a filter, and BioMart (https://m.ensembl.org/biomart/martview/0a90d25a087cc4dd7b24dab256d80e93), to select gene names from Ensembl Genes, Human genes (GRCh38.p13), retrieving only the gene name as feature, and filtering for redundant results, resulted in a pre-filtering gene count of 11.742. Next, a Perl script to grep cg IDs was used to identify gene names from GPL13534-11288.txt, a mapping between CpG islands and locations etc. The cg IDs were used to grep full lines from GSE27097_series_matrix.txt. A second Perl script was used to replace each cg ID in the Horvath 353 CpG list by a cg ID that maps to the current GO term, calculating Pearson correlation and heuristically selecting the best synonymous CpG island as a substitution. A third Perl script was used to filter the replacement CpG list, retaining the original CpG island if the replacement island had a Pearson correlation lower than e.g. 0.75 compared to the original CpG island, and also excluding any replacement CpG islands that do not map to all of GSE27097, 41037, or 42861, while maintaining a replacement frequency of greater than 50%, preferably higher. A fourth Perl script was used to decode the final cg ID list as common gene names, which can be confirmed for enrichment as measured by FDR on the Gene Ontology. Following this workflow, stable CpG island lists were created that perform better than 300 errors per 100 test cases, and displaying FDRs for enriched GO terms in the e-20, e-30, or e-40 range.
Using stochastic noise, a Perl script was used to introduce noise at a certain user defined level (0.01-0.2), or in increments of $noise_level=($m*0.025), where $m ranges from 0-10. The noise is introduced to the CpG methylation probability array for a selected individual using the formula ($line[$ind]+rand($noise_level)−($noise_level/2)). Each noised replicate is used as input to the machine learning process, and the results are plotted as a probability distribution in R studio.
Using a parameter scanning approach in Python, three random numbers were generated num1=random.randint(3, 7), num2=random.randint(50, 200), num3=random.randint(50, 100). These numbers correspond to the number of dense layers to be added to the Keras model, the size of each layer, and the number of epochs of training to be performed. A sample run of five epochs is performed to determine if the current setting favors CPU or GPU usage. Commonly, the product of num1 and num2 is also recorded as a measure of the number of trainable parameters.
Input may be reformatted as a square shape with some empty cells (20×20 dim). A random sample of 100 from the concatenated file was used. This could be different each time, and could certainly arbitrarily affect performance of an evaluation run. Temporary test set selections used a time stamp in seconds from a point back in time (used as temporary file name) and labeled. The test set selection can be in “srand mode”, where the selection is the same each time, in case of wanting to compare the same selection between e.g. different CpG island lists or different optimized parameters. The Keras model contains three invariable initial layers, a central portion with a variable number of N×M-sized dense layers, and two invariable final layers. The first invariable initial layer consists of a Conv2D layer, sized 100, with a 3×3 kernel size and a 20×20 input shape. The second invariable initial layer consists of a MaxPooling 2D layer with a 2×2 pool size. The third invariable initial layer (or layer operation) is a flattening operation, flattening the 2-d layer for fully connected layers. These initial layers are followed by N×M-sized Dense layers, using tf.nn.rely as the activation function. Finally, before the compilation of the model, there is a Dropout layer (using parameter 0.2). This is followed by the final Dense, 100-sized layer, using the tf.nn.softmax (Softmax) activation function. Compilation is achieved using a sparse categorical cross entropy, using accuracy metrics. A choice between whether to use CPU or GPU for the current parameters (N layers, M-sized layers, E # of epochs) is estimated from timing a 5-epochs long non-verbose test run using CPU and GPU. In some embodiments of the Python interface, the trained model can be saved and re-loaded, depending on the application. In an important aspect of the invention, by using parameter scanning, we can optimize the current parameters (N layers, M-sized layers, E # of epochs) such that they give maximum prediction performance for a GO category substituted CpG island list, offsetting any performance loss incurred in the island substitution process.
The fact that many public methylation data sets are annotated with age, gender, and sometimes ethnicity, make age predictions particularly easy to verify, even though different data sets may use different tissues, such as whole blood or saliva. Many labs have developed certain age prediction methods, including the “GrimAge” remaining lifespan index of the Horvath lab at UCLA (PMID: 30669119). These methods have also been commercialized (PMID: 32791973), sometimes sold as an independent service or sold in conjunction with anti-aging supplements, such as nicotinamide mononucleotide (NMN), and similar compounds.
It was shown that predictors which predict binary health outcomes in cross tissue comparisons could be successfully applied on non-native data, using the criteria of diversity and robustness rather than “accuracy”. This example is provided to illustrate how a points-based HealthIndex could be incrementally built up, in the wider general context of age and healthspan predictions, and how DCT would enable it.
In one embodiment, disclosed methods may include:
Prediction on non-native data: While the prediction performance on native data was good, the challenge comes when applying the model on non-native data, which could include data from a different tissue or data generation platform. To develop a points-based HealthIndex, where the true answer may not be known, there were two things to look for: “diverse and repeatable” predictions. For testing on non-native data, a different data set, GSE40279, was frequently used which comes from whole blood and contains 470.043 CpG islands from 656 middle aged subjects (PMID: 23177740). By summing the positive supervised predictions from 10 or 100 retrainings, an average prediction outcome was determined and visualized using e.g. “ggplot”. Retraining the model and counting the number of positive predictions from 10 vs. 100 consecutive retrainings reveal that the predictions are “robust” (i.e. they converged well before 100 retrainings), even though a “correct” answer is not necessarily known.
HealthIndex and DCT are highly related because the HealthIndex almost always uses a cross data set comparison. If merely age was predicted, it might be possible to find a data set that was similar to the test data, but for the HealthIndex typically the same test data and training data from many different studies are used, which almost certainly necessitates the use of the DCT in non-native data prediction.
Here, it has been shown that a points-based HealthIndex could be incrementally built up by adding up positive and negative health points taken from predictions made on non-native data. The development of a broader HealthIndex is based on the growing number of public data sets. In particular, the bottleneck problem of how non-native predictions could be processed was addressed by studying how predictions could be made on data that could come from a different tissue, or data generation platform.
This table shows the base performance correlation in cross data set comparisons (GSE41037, GSE90124, GSE51032, GSE36194, GSE50660, GSE42861, GSE39279, GSE53740, GSE64511, GSE40279; numbered ST0-9 in text). The columns refer to the testing set used; the rows refer to the training set used.
The table above show a cross data set test using traditional deep learning vs DCT method, testing on GSE41037 and training on GSE90124. The last two columns are the Aage from these methods respectively. For this example, the benefit of using the DCT method is that r increased from 0.38 to 0.74 while total error decreased from approximately 1600 years to approximately 800 years over 100 test cases.
The embodiments disclosed herein are illustrative of certain underlying principles. Modifications to the disclosed embodiments including alternative embodiments may be employed, and are within the skill of persons of ordinary skill in the art, and within the scope of the claims. The teachings and the claims are not limited to embodiments precisely as shown and described.
This application claims priority to U.S. Utility patent application Ser. No. 17/179,976 filed on Feb. 19, 2021, which claims priority to U.S. Provisional Application No. 63/110,174, filed Nov. 5, 2020, the contents of which are incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63110174 | Nov 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17179976 | Feb 2021 | US |
Child | 17361827 | US |