The present invention is in the field of cell free DNA methylome analysis.
Small fragments of DNA circulate freely in the peripheral blood of healthy and diseased individuals. These cell-free DNA (cfDNA) molecules are thought to derive from dying cells and are thus a reflection of ongoing cell death taking place in the body. In recent years, this understanding has led to the emergence of diagnostic tools which are impacting multiple areas of medicine. Specifically, next generation sequencing of fetal DNA circulating in maternal blood has allowed non-invasive prenatal testing of fetal chromosomal abnormalities; detection of donor-derived DNA in the circulation of organ transplant recipients can be used for early identification of graft rejection; and the evaluation of mutated DNA in the circulation can be used to detect, genotype and monitor cancer. These technologies are powerful at identifying genetic anomalies in circulating DNA yet are not informative when cfDNA does not carry mutations. A key limitation is that sequencing does not reveal the tissue origins of cfDNA, precluding the identification of tissue-specific cell death. The latter is critical in many settings such as neurodegenerative, inflammatory or ischemic diseases, not involving DNA mutations. Even in oncology, it is often important to determine the tissue origins of the tumor in addition to determining its mutational profile, for example in cancers of unknown primary (CUP) and in the setting of early cancer diagnosis. Finally, identification of collateral tissue damage (e.g. toxicity of drugs in genetically normal tissues) is a key element in drug development and monitoring of response to therapy.
Several approaches have been proposed recently for tracing the tissue sources of cfDNA, all based on tissue-specific epigenetic signatures. Nucleosome positioning in different tissues has been used to infer the origins of cfDNA, based on the idea that nucleosome-free regions are more likely to be degraded upon cell death and hence will be under-represented in cfDNA. An alternative approach is based on DNA methylation patterns. Methylation of cytosine adjacent to guanine (CpG sites) is an essential component of cell type-specific gene regulation, and hence is a fundamental mark of cell identity. It was recently shown that the detection of cfDNA molecules with a methylation pattern unique to a certain tissue can be used to identify cell death derived from that tissue. Others have taken a genome-wide approach to the problem and used the plasma methylome to assess the origins of cfDNA. It has been showed that the deconvolution of cfDNA methylation profiles can be used to infer relative contributions of four different tissues, using a low-depth whole genome bisulfite sequencing (WGBS). It was further demonstrated that the potential exists of using cfDNA methylation in detecting cancer DNA in circulation, as well as identifying its tissue of origin in two cancer types, using a reduced representation bisulfite sequencing (RRBS) approach. Nevertheless, a method of reliably determining the tissue of origin of all cfDNA found in a subject, and not just the most abundant is greatly in need. Such a method would allow for early diagnosis of conditions that may be unknown to the physician or patient, and not just a directed assessment of an already predicted condition.
The present invention provides methods of determining the origin of cell free DNA (cfDNA) and for detecting death of a cell type or tissue in a subject by determining the origin of cfDNA in the subject are provided. Computer program products for doing same and methods of constructing a methylome atlas are also provided.
According to a first aspect, there is provided a method of determining the cell type or tissue of origin of cell free DNA (cfDNA) comprising:
According to another aspect, there is provided a method of detecting death of a cell type or tissue in a subject comprising:
According to some embodiments, at least 50 ng of cfDNA are provided. According to some embodiments, the providing comprises providing a bodily fluid and isolating the cfDNA from the bodily fluid.
According to some embodiments, the measuring DNA methylation comprises bisulfite conversion of the cfDNA. According to some embodiments, the measuring further comprises performing a methylome array or chip on the bisulfite converted cfDNA.
According to some embodiments, the methylome atlas comprises only data from purified cell types. According to some embodiments, the atlas comprises only data from non-blood derived purified cell types. According to some embodiments, the methylome atlas comprises methylation data from at least 5 of the following 34 tissues or cell types: monocytes, B-cells, CD4+ T-cells, NK-cells, CD8+ T-cells, eosinophils, neutrophils, erythrocyte progenitors, adipocytes, neurons, hepatocytes, lung alveolar cells, pancreatic beta cells, pancreatic acinar cells, pancreatic duct cells, vascular endothelial cells, left atrium, bladder, breast, cervix, colon, esophagus, oral cavity, kidney, prostate, rectum, stomach, thyroid, uterus, lung bronchial cells, cholangiocytes, muscle, oligodendrocytes, and ovary. According to some embodiments, the methylome atlas comprises data from all of the 34 tissues or cell types.
According to some embodiments, the methylome atlas comprises at least the 100 most uniquely methylated or unmethylated sites in each tissue or cell type. According to some embodiments, the methylome atlas further comprises any CpG sites within at least 150 base pairs upstream and downstream of the most uniquely methylated and most uniquely unmethylated sites in each tissue or cell type. According to some embodiments, the CpG sites within at least 150 base pairs upstream and downstream are selected from Tables 1 and 2. According to some embodiments, the methylome atlas further comprises at least one of the 500 CpG sites that best differentiate between the most similar pairs of tissues and cell types. According to some embodiments, the 500 CpG sites that best differentiate between the most similar pairs of tissues and cell types are selected from Table 3. According to some embodiments, the most uniquely methylated sites are selected from Table 1. According to some embodiments, the most uniquely hypomethylated sites are selected from Table 2.
According to some embodiments, cfDNA of the tissue or cell type comprises as little 1% of all of the cfDNA.
According to some embodiments, the methods of the invention are for use in detecting a disease state in a subject in need thereof and wherein the cfDNA is from the subject. According to some embodiments, the disease state is selected from organ transplantation, sepsis, and cancer. According to some embodiments, the disease is cancer, and the method determines the cell or tissue of origin of the cancer.
According to another aspect, there is provided a computer program product for determining the cell or tissue of origin of cell free DNA (cfDNA), comprising a non-transitory computer-readable storage medium having program code embodied thereon, the program code executable by at least one hardware processor to
According to another aspect, there is provided a method of constructing a methylome atlas, comprising:
According to some embodiments, the methods of the invention further comprise:
According to some embodiments, the methods of the invention further comprise:
According to another aspect, there is provided a computerized method of determining the cell type or tissue of origin of cell free DNA (cfDNA) comprising:
using at least one hardware processor to:
According to another aspect, there is provided a computer program product for determining the cell or tissue of origin of cell free DNA (cfDNA), comprising a non-transitory computer-readable storage medium having program code embodied thereon, the program code executable by at least one hardware processor to:
According to another aspect, there is provided a computerized system for determining the cell or tissue of origin of cell free DNA (cfDNA), comprising: at least one hardware processor; and a non-transitory computer-readable storage medium having program code embodied thereon. The program code executable by the at least one hardware processor to:
According to some embodiments, the first set comprises between 25 and 100 most uniquely methylated sites.
According to some embodiments, the first set comprises a plurality of most uniquely methylated sites and wherein the plurality of neighboring methylated sites comprises any CpG sites within between 150 and 500 base pairs upstream and downstream of said most uniquely methylated sites in each tissue or cell type.
According to some embodiments, the second set comprises between 100 and 500 most uniquely methylated sites.
According to some embodiments, the second set of uniquely methylated sites compares a plurality of specific pairs or triplets of tissue types, such as similar tissue types, or the like.
According to some embodiments, the first set of uniquely methylated sites are uniquely methylated as compared to all cell types and tissues of the atlas, and wherein the second set of uniquely methylated sites are uniquely methylated in one cell type or tissue as compared to a second most similar cell type or tissue.
According to some embodiments, the comparing comprises using a latent probabilistic model.
Further embodiments and the full scope of applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description together with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
The present invention provides methods of determining the origin of cell free DNA (cfDNA) and detecting death of a cell type or tissue in a subject by determining the origin of cfDNA in the subject. The methods of the invention are based on the surprising finding that by generating an atlas of informative methylation sites for various tissues and cell types cfDNA methylation sequencing can be deconvoluted to accurately identify the origin of cfDNA molecules even when they are a very small percentage of the total DNA sampled. The methods of the invention are further based on the surprising findings that use of purified cell types in place of whole tissues and an atlas comprising the most uniquely methylated and unmethylated sites between the different tissues/cell types provided superior deconvolution results.
By a first aspect, there is provided a method of determining the cell or tissue of origin of a cell free DNA (cfDNA) comprising:
In some embodiments, the cfDNA is from a subject and assigning a cfDNA molecule to a cell or tissue of origin indicates detection of death of that cell or tissue. In some embodiments, the subject is suspected of having increased cell death. In some embodiments, the subject is not suspected of having increased cell death. In some embodiments, the subject appears healthy and/or does not suffer from a disease or condition.
By another aspect, there is provided a method of detecting death of a cell type or tissue in a subject comprising:
As used herein, “cfDNA” refers to any DNA obtained from an organism which existed in the organism outside of a cell. In some embodiments, the cfDNA is DNA obtained from an organism which existed in the organism outside of any vesicle. Cell-free DNA is well known in the art, and generally refers to DNA that is free floating within a bodily fluid. This DNA is generally not enclosed in a vesicle and thus DNA in transport, such as by exosomes or other vesicular transporters, in not considered cfDNA. In some embodiments, cfDNA is DNA from a dying and/or dead cell. When a cell dies the DNA is generally fragmented and released from the cell as it lyses. This DNA however, is not all immediately removed or cleaned up and thus persists in the organism. Frequently the DNA from the dead cell enters the bloodstream.
Since cfDNA has a short half-life in the organism, it provides a snap shot of the cell death occurring in the organism at that moment. In some embodiments, the methods of the invention detect cell death that has occurred within the last 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, or 60 minutes from the time of providing the cfDNA. Each possibility represents a separate embodiment of the invention.
In some embodiments, the cfDNA is mammalian cfDNA. In some embodiments, the cfDNA is human cfDNA. In some embodiments, the cfDNA is extracted from bodily fluid. In some embodiments, the providing comprises providing a bodily fluid and isolating the cfDNA from the bodily fluid. In some embodiments, the bodily fluid is blood. In some embodiments, the bodily fluid is selected from at least one of: blood, serum, gastric fluid, intestinal fluid, saliva, bile, tumor fluid, interstitial fluid, and stool. Standard techniques for cell-free DNA extraction are known to a skilled artisan, a non-limiting example of which is the QIAamp Circulating Nucleic Acid kit (QIAGEN).
In some embodiments, at least 10, 20, 30, 40, 50, 60, 70, 80, 90 or 100 ng of cfDNA are provided. Each possibility represents a separate embodiment of the invention. In some embodiments, as little as 10, 20, 30, 40, 50, 60, 70, 80, 90 or 100 ng of cfDNA are provided. Each possibility represents a separate embodiment of the invention. In some embodiments, at least 50 ng are provided. In some embodiments, as little as 50 ng are provided.
In some embodiments, the providing comprises providing a bodily fluid and isolating the cfDNA from the bodily fluid. In some embodiments, the bodily fluid is selected from: blood, serum, gastric fluid, intestinal fluid, saliva, bile, tumor fluid, interstitial fluid, breast milk and stool. In some embodiments, the bodily fluid is any bodily fluid that contains cfDNA. In some embodiments, the bodily fluid is blood. In some embodiments, the bodily fluid is any one of whole blood, partially lysed whole blood, plasma, or partially processed whole blood.
The sample of blood can be obtained by standard techniques, such as using a needle and syringe. In another embodiment, the blood sample is a peripheral blood sample. Alternatively, the blood sample can be a fractionated portion of peripheral blood, such as a plasma sample. In another embodiment, once the blood sample is obtained, total DNA can be extracted from the sample using standard techniques known to one skilled in the art. In some embodiments, intact cells are removed before DNA extraction, so that only free-floating DNA is extracted. Intact cells can be removed by any method known in the art, such as for non-limiting example by centrifugation or by gradient separation, such as by Ficol gradient separation. A non-limiting example for DNA extraction is the FlexiGene DNA kit (QIAGEN). In another embodiment, maternal plasma may be further separated from peripheral blood by centrifugation, such as exemplified herein, at 1,900×g for 10 minutes at 4° C. The plasma supernatant may be re-centrifuged at 16,000×g for 10 minutes at 4° C. In another embodiment, a fraction of the resulting supernatant is used for cell-free DNA extraction, to thereby receive plasma DNA extracts. Standard techniques for receiving cell-free DNA extraction are known to a skilled artisan, a non-limiting example of which is the QIAamp Circulating Nucleic Acid kit (QIAGEN). In some embodiments, the total cfDNA is subsequently fragmented, such as to sizes of approximately 300 bp-800 bp. For example, the total DNA can be fragmented by sonication.
Measuring DNA methylation may be performed by any method known in the art. Non-limiting examples include deep sequencing following bisulfite conversion, ELISA-based methylation kits, methylation sensitive PCR, and the luminometric methylation assay (LUMA). In some embodiments, measuring DNA methylation comprises bisulfite conversion. In some embodiments, measuring DNA methylation further comprises next generation sequencing. In some embodiments, measuring DNA methylation further comprises next generation sequencing. In some embodiments, only the loci present in the atlas are sequenced. Next generation sequencing, also known as high-throughput sequencing is any sequencing method that allows for rapid high-throughput sequencing of base pairs from DNA or RNA samples. Such sequencing is well known in the art and can include Illumina arrays and ion torrent as non-limiting examples. Next generation sequencing of DNA methylation works on a similar principle and may be performed with arrays such as the Illumina EPIC array and the Illumina 450 k array, for example. In some embodiments, data from the whole genome is used. In some embodiments, data from chips or arrays are used. Such chip/array data may decrease background, lower costs, and provide more reliable cleaner data.
In some embodiments, the methylome atlas comprises data from at least 1, 3, 5, 10, 15, 20, or 25 tissues. Each possibility represents a separate embodiment of the invention. In some embodiments, the methylome atlas comprises data from at least 1, 3, 5, 10, 15, 20, or 25 cell types. Each possibility represents a separate embodiment of the invention. In some embodiments, the methylome atlas comprises data from at least 1, 3, 5, 10, 15, 20, or 25 tissues and/or cell types. Each possibility represents a separate embodiment of the invention. It will be understood by one skilled in the art that only tissues and/or cell types can be identified as the origin of the cfDNA is they are included in the atlas. Thus, for example, if only 10 tissues are selected than cfDNA can only be identified as coming from one of those tissues. Similarly, if it is predicted that the cfDNA may be from a particular source it may be less time consuming and cheaper to use a smaller atlas. For example, if hepatotoxicity is being examined, the atlas may comprise only the hepatocyte methylome, or only methylomes from liver cell types. In such a case the readout would be X % of the cfDNA is from dead hepatocytes/liver cells and the rest is unknown. If the possible source of cfDNA is unknown, or if a subject is healthy (or appears healthy) than a broader atlas comprising CpGs from more tissues would be preferred. In some embodiments, the atlas comprises data from at least 5 cell types and/or tissues.
In some embodiments, the atlas comprises data only from tissues. In some embodiments, the atlas comprises data only from cell types. In some embodiments, the atlas comprises data from tissues and cell types. In some embodiments, the cell types are purified cell populations. In some embodiments, the cell types comprise blood-derived purified cell populations. In some embodiments, the cell types comprise tissue-derived purified cell populations. In some embodiments, the atlas does not comprise data from blood-derived purified cell population. In some embodiments, the atlas consists of only tissue-derived purified cell population data. In some embodiments, atlas comprises data from blood-derived and tissue-derived purified cell populations. As used here, “blood-derived” and “tissue-derived” cell types or cell populations refer to a cell type or population whose source is either blood or a tissue or organ. Blood-derived population are well known, and may be red blood cells, monocytes, b-cells, t-cells, or the like. They may express specific markers, such as CD4-positive or CD-8 positive T cells, for non-limiting example. Tissue-derived cells are from a tissue or organ and not blood. All organs are made up for multitudes of cells that may be identified by markers, such as protein expression, surface expression, secretion or morphology. Examples include different neurons in the brain, and beta/duct/acinar cells in the pancreas.
In some embodiments, the methylome atlas comprises methylation data from at least 1, 2, 3, 4, 5, 10, 15, 20, 25, 30 or 34 of the following 34 tissues or cell types: monocytes, B-cells, CD4+ T-cells, NK-cells, CD8+ T-cells, eosinophils, neutrophils, erythrocyte progenitors, adipocytes, neurons, hepatocytes, lung alveolar cells, pancreatic beta cells, pancreatic acinar cells, pancreatic duct cells, vascular endothelial cells, left atrium, bladder, breast, cervix, colon, esophagus, oral cavity, kidney, prostate, rectum, stomach, thyroid, uterus, lung bronchial cells, cholangiocytes, muscle, oligodendrocytes, and ovary. Each possibility represents a separate embodiment of the invention. In some embodiments, the methylome atlas comprises methylation data from at least 1, 2, 3, 4, 5, 10, 15, 20, 25, 29, 30, 31, 32, 33 or 34 of the following 34 tissues or cell types: monocytes, B-cells, CD4+ T-cells, NK-cells, CD8+ T-cells, eosinophils, neutrophils, erythrocyte progenitors, adipocytes, neurons, hepatocytes, lung alveolar cells, pancreatic beta cells, pancreatic acinar cells, pancreatic duct cells, vascular endothelial cells, left atrium, bladder, breast, cervix, colon, esophagus, oral cavity, kidney, prostate, rectum, stomach, thyroid, uterus, lung bronchial cells, cholangiocytes, muscle, oligodendrocytes, and ovary. Each possibility represents a separate embodiment of the invention. In some embodiments, the methylome atlas comprises methylation data from at least 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 34 or 35 of the following 35 tissues or cell types: monocytes, B-cells, CD4+ T-cells, NK-cells, CD8+ T-cells, eosinophils, neutrophils, erythrocyte progenitors, adipocytes, neurons, hepatocytes, lung alveolar cells, pancreatic beta cells, pancreatic acinar cells, pancreatic duct cells, vascular endothelial cells, left atrium, bladder, breast, cervix, colon, esophagus, oral cavity, head and neck, kidney, prostate, rectum, stomach, thyroid, uterus, lung bronchial cells, cholangiocytes, muscle, oligodendrocytes, and ovary. Each possibility represents a separate embodiment of the invention. In some embodiments, the methylome atlas comprises methylation data from at least 1, 2, 3, 4, 5, 10, 15, 20, 25, 29, 30, 31, 32, 33, 34 or 35 of the following 35 tissues or cell types: monocytes, B-cells, CD4+ T-cells, NK-cells, CD8+ T-cells, eosinophils, neutrophils, erythrocyte progenitors, adipocytes, neurons, hepatocytes, lung alveolar cells, pancreatic beta cells, pancreatic acinar cells, pancreatic duct cells, vascular endothelial cells, left atrium, bladder, breast, cervix, colon, esophagus, oral cavity, head and neck, kidney, prostate, rectum, stomach, thyroid, uterus, lung bronchial cells, cholangiocytes, muscle, oligodendrocytes, and ovary. Each possibility represents a separate embodiment of the invention. In some embodiments, the monocytes are CD14+ monocytes. In some embodiments, the B-cells are CD19+ B-cells. In some embodiments, the NK-cells are CD56+ NK-cells. In some embodiments, oral cavity cells are head and neck cells.
In some embodiments, the atlas comprises at least the 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000 or 5000 most uniquely methylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention. In some embodiments, the atlas comprises at least 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, or 100 of the 100 most uniquely methylated sites in a tissue or in each. Each possibility represents a separate embodiment of the invention. In some embodiments, the atlas comprises at least the 100 most uniquely methylated sites in a tissue. In some embodiments, the methylome atlas further comprises any CpG sites within at least 50, 100, 150, 200, 250, 300, 400, 500, 1000, 1500, or 2000 base pairs upstream and/or downstream of the most uniquely methylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention.
In some embodiments, the atlas comprises at least the 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000 or 5000 most uniquely unmethylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention. In some embodiments, the atlas comprises at least 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, or 100 of the 100 most uniquely unmethylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention. In some embodiments, the atlas comprises at least the 100 most uniquely unmethylated sites in a tissue. In some embodiments, the methylome atlas further comprises any CpG sites within at least 50, 100, 150, 200, 250, 300, 400, 500, 1000, 1500, or 2000 base pairs upstream and/or downstream of the most uniquely unmethylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention.
Selection of hyper/hypomethylated sites can be performed as follows: For finding cell-type specific CpGs, after obtaining a matrix X of N CpGs by K cell types, where each column represents the average methylation of all samples of that cell type at each CpG, a matrix X′ is then created, where each row Xi′ was calculated as
For each cell type k, the rows with the highest value in column k as uniquely hyper-methylated CpGs can be identified. To identify uniquely hypomethylated CpGs, a similar process is performed: a matrix X″=1−X was defined, then a matrix
was defined and again the rows with the highest value in column k as uniquely hypomethylated CpGs were identified. For each cell type, the top hypomethylated CpGs and unmethylated CpGs can be included in the reference matrix. A more detail description of an algorithm that may be used for site selection follows.
Optionally, an efficient algorithm for selecting a small number of CpGs for inference of cell-of-origin for circulating cell-free DNA using DNA methylation patterns and Compressed Sensing.
Given an expression atlas X, composed of n CpGs over d cell types, a matrix A will denote a normalized version of the matrix X For rows Xi whose average methylation <0.5 we will set Ai=Xi, and set A1=1−Xi for all other rows. Next, we will normalize each row Ai to have a Euclidean norm of one: Ai=Ai/∥Ai∥. We are now ready to set the Compressed Sensing problem. Given a matrix A, we wish to find a column-sparse matrix W such that W×A is approximately the identity matrix, and:
where s is the number of non-zero columns in W.
Once W is found such that W×A≈I, for every vector Y=Aβ that is a linear combination of columns in A, it applies that WY=W(Aβ)=(WA)β=Iβ=β or in other words, multiplying a mixed vector Y by W would immediately result in the mixture coefficients β.
Optionally, a greedy algorithm is used for finding W. First, assume that a set S of the CpGs (|S|=s) is provided. Given such a partial set S, a projection of W to the S columns is denoted by Ws, and a projection of A to the S rows is denoted by As.
Thus, finding the non-zeros columns of W, denoted Ws, is equivalent to finding a matrix B which is the pseudo-inverse of the S rows of A, and may be achieved with standard tools (e.g. the pinv function in MATLAB or numpy.linalg.pinv in Python). Next, the set S may be increased by adding the next, i'th, feature that minimizes:
This is equivalent to using the current base Ws, and finding a new orthogonal vector v, such that:
Let one define the matrix C=I−WsAs and a function ƒ(v)=∥vAi−C∥2. By deriving ƒ with respect to v, one may find the optimal value v=CAi/∥Ai∥2 for each i. Next, one may search for the next feature i minimizing ƒ(v). By substituting v in EQN 1 and some manipulations, one obtains:
where the n×d matrix ACT may be computed and the row i chosen with the maximal L2 norm.
By repeating this procedure d times, with d denoting the number of cell types in the atlas, and the results given that the rank of A>d, an optimal subset of d CpG sites may be found that reveal the mixture coefficients β.
To obtain a more robust solution, this procedure may be repeated J times (each time, the (j) is used after excluding all features S from previous (j) runs), yielding several sparse matrices W. By averaging over these matrices W′=E[W], a sparse W′ matrix may be obtained which is defined over a superset of |S×J| CpG features. The multiplication of the W′ matrix with any mixture data Y may robustly estimate the mixture coefficient β.
Due to measurement noise, this approach may return erroneous results from specific samples of very similar cell types. Such cell types may have very similar methylation patterns, with few separating CpGs. To correct the erroneous results, one may choose to focus on a subset of CpG sites using a feature selection approach, that would construct an “Error-correcting code” as follows.
Another set S may explicitly include CpGs that are differentially methylated between overall similar cell types. Given a current set S of CpG sutes, one could consider the distance between all pairs of cell types <i,j> when projected onto the current set S of CpGs, and identifying the most similar pair <i,j>. Then, one can identify the CpG site <k> that is the most differentially methylated among cell types i and j, and add the CpG <k> into the set S. By repeating this process iteratively, one can identify a large set of CpGs S whose methylation pattern may differentiate between similar tissues. Namely, at each stage, one would identify the most similar pair of cell types/tissues <i,j> given the currnet set 5,
and then find the k'th feature that would further separate them the most:
Pairwise distances may be computed among all cell types and used to separate the current pair of tissues. This procedure may be iteratively applied until all pairs of cell types are differentiated.
A representative example of CpG selection using the Compressed Sensing algorithm can be found in
This approach may be scalable and applied to much larger datasets, covering hundreds or thousands of tissue types across all 30M CpGs. Thus, it may be suitable for analyzing whole-genome bisulfite-seq data, e.g. for cell-of-origin identification. In addition, it may be applied to identify the most informative regions (or CpG blocks) along the genome, thus used for designing efficient targeted applications (e.g. capture-based). The technique allows to explicitly focus on pairs of similar cell types that are prone to be confused by other methods, thus identifying a set of key CpGs for accurate and robust analysis of cell-free DNA methylation data.
In some embodiments, the site selection is performed using a latent probabilistic model applied to cell free DNA methylation bisulfite sequencing data.
For example, a latent probabilistic model is used for the analysis of bisulfite sequencing DNA methylation data to infer the cell type and tissue type composition of cell free DNA (tissue of origin) and to quantitatively detect circulating tumor DNA in peripheral blood samples, while incorporating prior medical knowledge.
For example, circulating cell free DNA fragments in the peripheral blood may be analyzed to infer the quantitative admixture of tissues and specific cell types from which the cell fragments originated, and to detect small fractions of circulating tumor DNA fragments using CpGs methylation patterns along the genome. Disclosed herein is a computational model to analyze data from a whole genome, a reduced representation, a capture-based method, or the like, followed by bisulfite sequence (BS-seq) determination on DNA fragments originating from the plasma of peripheral blood samples of human patients.
The probabilistic model infers the relative amounts (in genomes/ml units) of tissue-specific and tumor-specific DNA fragments found in plasma samples. A unified probabilistic model, whose (latent) parameters are composed of (a) the admixture coefficient θ, and (b) the tissue-specific and tumor-specific statistical models of CpG methylation patterns. A Bayesian or Maximum Likelihood estimations are applied to infer the latent parameters thus quantitatively identifying the admixture contributions and/or to estimate statistical confidence intervals for each admixture coefficient θ and/or to infer the estimated probability of the θ>0 for each healthy or pathological cell type.
For example, consider bisulfite sequencing data D, composed of multiple sequenced reads (or fragments) f. The likelihood of the data D given the model parameters θ can be written as:
where the likelihood of observing a read f is written as p(f) and assuming independence between sequenced reads.
The reads may be sampled from (an unobserved) composition of multiple types t=1 . . . T of methylation patterns, each from a different cell type, tumor-type, or some cellular condition. Using the law of total probability, we can decompose p(f) as a weighted sum of conditional probabilities pt (f) for each cell type t:
where p(t)=θt represents the relative amount of cell-free DNA fragments in the plasma from cell-type or tumor-type t and p(f|t)=pt (f) denotes the conditional probability of observing the sequenced read, f in cell-free DNA that has originated from cell type t.
The parameters of this model, including the admixture parameters θt and the cell type specific likelihood model pt (f) are unobserved. Following is described how the likelihood is optimized with regard to θ.
An unknown distribution pt (f) of observing a read f from the plasma portion contributed by cell type t. One can approximate pt(f) using a methylation atlas, containing bisulfite sequencing data from purified cell types and tumor biopsies. Such data, from either whole genome bisulfite sequencing or from reduced representation bisulfate sequencing, may be rapidly accumulated and available for multiple cell types and tissues (healthy or pathological). In addition, some distributions pt (f) may be estimated from other types of CpG methylation data, such as Illumina Infinium 450K or EPIC BeadChip platforms, which may be available for multiple cell types. Statistical correlations between adjacent CpGs in the same CpG haplotype blocks may be used for approximating the joint probability of CpG methylation in a genome wide manner.
The probability of multiple adjacent CpGs may be approximated using probabilistic graphical models that decompose the joint probability of multiple CpGs into compact models with few parameters by assuming conditional independencies (e.g. Markovian mathematical properties). For example, each CpG haplotype block i that contains up to hundred CpGs can be modeled in each cell type t using at least two parameters: βti that denotes the average methylation of CpGs within this block, and τti that denotes the probability of two adjacent CpG being correlated (i.e., similarly methylated in a DNA molecule). This model may allow decomposing the likelihood of a given observation. For example, the likelihood of a read from the i'th block in the t'th cell type, with a CpG methylation pattern of TCCCTTT (C=methylated CpG, T=unmethylated CpG, with other nucleotides ignored) may be determined with the equation:
P(TCCCTTTT|θt
where the read contains three methylated CpGs (each with probability βti), four unmethylated CpG (each with probability 1−βti), four consecutive pairs of equally methylated CpGs (each, at probability m) and two consecutive pairs with alternating methylation (each with probability 1−τti).
Optionally, given bisulfite sequencing data from cell type t, Maximum Likelihood models are used to infer the values βti and τti for each cell type t and haplotype block i using:
where 1{ } denotes the indicator function, and fj denotes the j'th CpG sequenced in the read f.
Using Bayes' theorem, one can infer for each read f, the posterior probability of it originating from the cell type t:
Here, P(ti) denotes modeling the a-priori probability of observing a read from the i'th block, originating from the t'th cell type. This may allow one to model a non-uniform coverage of reads circulating in the plasma, in a tissue-specific manner, and more importantly, allows for the integration of prior patient-specific medical data into the computational model.
These inferred posterior probabilities of the cell-of-origin of each observed read, may play a role in detecting small amounts of circulating tumor cell free DNA fragments (ctDNA). Similarly, they may be integrated for the identification and quantification of circulating DNA from various sources (e.g. pathological or normal), including the estimations of confidence intervals.
More generally, a similar approach may infer the admixture coefficient p(t) of the entire circulating DNA in the plasma—finding a Maximum Likelihood solution to the deconvolution problem. Optionally this may be done by maximizing the likelihood of the data D, with subject to the p(t) using an Expectation Maximization algorithm, which iteratively calculates the expected probability of assigning each read to each originating tissue (E-step) and then computes the Maximum Likelihood estimation (or Bayesian estimation, given some medical prior knowledge) for each cell type that contributed DNA to the plasma.
This model may not be limited to whole genome bisulfite sequencing data and may be applied to reduced representation bisulfite sequencing data, or to capture-based bisulfite sequencing data. Moreover, for scalability and speed up some feature selection procedures may be applied prior to applying this model, thus focusing the model on the informative portions of the data and possibly ignoring sequenced reads that originate from other regions of the genome.
In some embodiments, the 100 most uniquely methylated sites are selected from Table 1. In some embodiments, the 100 most uniquely unmethylated sites are selected from Table 2. In some embodiments, the CpG sites within at least 150 base pairs upstream and downstream of the most uniquely methylated sites are selected from Table 1. In some embodiments, the CpG sites within at least 150 base pairs upstream and downstream of the most uniquely unmethylated sites are selected from Table 2.
In some embodiments, the methylome atlas further comprises at least one of the 500 CpG sites that best differentiate between the most similar pairs of tissues and cell types. In some embodiments, the analysis of which are the 500 best CpGs is performed iteratively, such that a new decision of which pair of tissues or cell types is most similar is made after each new CpG is added. In some embodiments, at least 1, 5, 10, 50, 100, 150, 200, 250, 300, 350, 400, 450, or 500 CpG sites that best differentiate are added to the atlas. One skilled in the art will understand that with greater numbers of tissues/cell types in the atlas, and with more similar tissues/cell types most of these informative CpGs may be added.
Iterative selection of CpGs that differentiate between the most similar pairs of tissues/cell types can be performed as follows in order to correct for very similar methylomes. Starting with a set S of CpGs (as previously described), the Euclidian distance over S was calculated for every pair of cell types <j,k> as
and the most similar pair <j′,k′> was identified. Then, the row with the highest difference in values in columns j′ and k′ can be identified, and added into the set S. In some embodiments, the 500 CpG sites that best differentiate between the most similar pairs of tissues and/or cell types are selected from Table 3. More details of algorithms that may be used for this correction are found herein.
In some embodiments the methods of the invention can be used to determine the origin of cfDNA even when the cfDNA from one tissue/cell types is a very small percentage of the whole cfDNA. In some embodiments, the cfDNA of a tissue and/or cell type comprises as little 0.5%, 1%, 1.5%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9% or 10% of all of the cfDNA. Each possibility represents a separate embodiment of the invention. In some embodiments, the cfDNA of a tissue and/or cell type comprises more than 0.5%, 1%, 1.5%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9% or 10% of all of the cfDNA. Each possibility represents a separate embodiment of the invention. In some embodiments, the cfDNA of a tissue and/or cell type comprises less than 0.5%, 1%, 1.5%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9% or 10% of all of the cfDNA. Each possibility represents a separate embodiment of the invention. In some embodiments, the cfDNA of a tissue and/or cell type comprises between 0.5%-10%, 1%-10%, 1.5%-10%, 2%-10%, 0.5%-9%, 1%-9%, 1.5%-9%, 2%-9%, 0.5%-8%, 1%-8%, 1.5%-8%, 2%-8%, 0.5%-7%, 1%-7%, 1.5%-7%, 2%-7%, 0.5%-6%, 1%-6%, 1.5%-6%, 2%-6%, 0.5%-5%, 1%-5%, 1.5%-5%, 2%-5%, 0.5%-4%, 1%-4%, 1.5%-4%, 2%-4%, 0.5%-3%, 1%-3%, 1.5%-3%, 2%-3%, 0.5%-2%, 1%-2%, 1.5%-2%, or 0.5%-1.5% of all of the cfDNA. Each possibility represents a separate embodiment of the invention.
In some embodiments, the methods of the invention determine cfDNA is from a tissue. In some embodiments, the methods of the invention determine cfDNA is from a cell type. In some embodiments, the methods of the invention determine cfDNA is from a tissue and/or a cell type. In some embodiments, cfDNA may be determined to come from more than one cell type of a tissue. In some embodiments, the presence of cfDNA from a cell type or cell types may be used to determine the cfDNA is from the tissue from which the cell type is derived. It will be understood by one skilled in the art that the specificity of the methylation marks defines the specificity of the result. If a beta-cell mark is elevated, it means beta-cells died. It says nothing about other cell types inside or outside the pancreas. If markers from the whole pancreas are elevated, or markers from multiple pancreatic cell types are elected, it means pan-pancreatic damage. Thus, if only one cell type is elevated it means selective damage to that cell type.
In some embodiments, the methods of the invention are for use in detecting a disease state or condition in a subject in need thereof and wherein the cfDNA is from the subject. In some embodiments, the methods of the invention are for diagnosing a disease, and/or condition in a subject in need thereof and wherein the cfDNA is from the subject. In some embodiments, the methods of the invention are for diagnosing an increased risk of a disease or condition. A skilled artisan will recognize that many if not all disease states induce cell death in the tissue or cells in which the disease is manifest. As such knowledge of the origin of cell death be a surrogate for the disease. In some embodiments, the disease state or condition is selected from organ transplantation, sepsis, and cancer. In some embodiments, the disease state or condition is selected from organ transplantation, sepsis, cancer, neurodegenerative disease, degenerative disease, infection, inflammatory disease, toxicity, trauma, hypoxia, vascular disease and metabolic stress. In some embodiments, the disease is cancer and the methods of the invention determine the cell or tissue of origin of the cancer.
By another aspect, there is provided a computer program product for determining the cell or tissue of origin of cell free DNA (cfDNA), comprising a non-transitory computer-readable storage medium having program code embodied thereon, the program code executable by at least one hardware processor to:
By another aspect, there is provided a computer program product for determining the cell or tissue of origin of cell free DNA (cfDNA), comprising a non-transitory computer-readable storage medium having program code embodied thereon. The program code is executable by at least one hardware processor to:
For example, an atlas of uniquely methylated sites has different scales of differentiation. A first scale may be each tissue type against all other tissue types. A second scale may be one tissue type compared to a few tissue types, such as between 1 and 10 other tissue types. An associated set of neighboring sites may be used on the first or second sets along to better determine similar tissues. For example, there may be k-levels of scales for differentiating between tissue types, such as differentiating between different genetic families or the like. For example, each comparison scale atlas data may be considered a subset of the full comparison atlas with N to N comparisons. For example, a series of rules and comparison subsets may allow differentiating tissues with greater than 95% accuracy, greater than 97.5% accuracy, greater than 98.5% accuracy, greater than 99.5% accuracy, and/or the like.
By another aspect, there is provided a computerized system for determining the cell or tissue of origin of cfDNA, comprising:
By another aspect, there is provided a computerized system for determining the cell or tissue of origin of cfDNA, comprising: (i) at least one hardware processor; and (ii) a non-transitory computer-readable storage medium having program code embodied thereon. The program code executable by the at least one hardware processor to:
In some embodiments, the first set comprises between 25 and 100 most uniquely methylated sites. For example, the first subset is a wide scale search for similar tissue types, but may not differentiate to specific tissue types.
In some embodiments, the first set comprises a plurality of most uniquely methylated sites and wherein the plurality of neighboring methylated sites comprises any CpG sites within between 150 and 500 base pairs upstream and downstream of said most uniquely methylated sites in each tissue or cell type. For example, the neighboring methylated sites are the patterns of methylated sites surrounding the uniquely methylated sites, such as within a window difference from each uniquely methylated site, such as a fixed base unit distances, and/or the like.
In some embodiments, the second set comprises between 100 and 500 most uniquely methylated sites. For example, the second subset is a limited list of comparisons (i.e. pairs, triplets, quadruples, etc.) that would fully differentiate a tissue sample data from similar tissue types (i.e. similar uniquely methylated sites).
In some embodiments, the second set of uniquely methylated sites compares a plurality of specific pairs or triplets of tissue types, such as similar tissue types, or the like.
In some embodiments, the first set of uniquely methylated sites are uniquely methylated as compared to all cell types and tissues of the atlas, and wherein the second set of uniquely methylated sites are uniquely methylated in one cell type or tissue as compared to a second most similar cell type or tissue. This example of multiscale genetic searching may allow quick determination of a sample origin and possible pathologies from a minimal sized atlas. The benefits of a small atlas are easier updates, distribution, and/or the like.
In some embodiments, the comparing comprises using a latent probabilistic model. For example, multiple models may be used to determine the highest probability tissue types.
In some embodiments, the methylome atlas is of at least 5 cell types or tissues, wherein said atlas comprises at least 25 of the 100 most uniquely methylated sites and at least 25 of the 100 most uniquely unmethylated sites in each of said 5 cell types or tissues.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
Embodiments may comprise a computer program that embodies the functions described and illustrated herein, wherein the computer program is implemented in a computer system that comprises instructions stored in a machine-readable medium and a processor that executes the instructions. However, it should be apparent that there could be many different ways of implementing embodiments in computer programming, and the embodiments should not be construed as limited to any one set of computer program instructions. Further, a skilled programmer would be able to write such a computer program to implement one or more of the disclosed embodiments described herein. Therefore, disclosure of a particular set of program code instructions is not considered necessary for an adequate understanding of how to make and use embodiments. Further, those skilled in the art will appreciate that one or more aspects of embodiments described herein may be performed by hardware, software, or a combination thereof, as may be embodied in one or more computing systems. Moreover, any reference to an act being performed by a computer should not be construed as being performed by a single computer as more than one computer may perform the act.
By testing device is meant a combination of components that allows the methylation of a piece of DNA to be determined. In some embodiments, the testing device allows for the high-throughput determinization of DNA methylation. The components may include any of those described above with respect to the methods for determining DNA methylation. For example, the components may be bisulfate conversion kits, or Illumina methylation arrays, and so on.
In certain embodiments the system or test kit further comprises a display for the output from the processor.
By another aspect, there is provided a method of constructing a methylome atlas, comprising:
In some embodiments, the methods of the invention further comprise:
In some embodiments, the methods of the invention further comprise:
In some embodiments, the atlas is constructed only with data from whole tissues. In some embodiments, the atlas is constructed only with data from purified cell populations. In some embodiments, the atlas is constructed from data from both tissues and cell populations. As used here, the term “cell type” refers to a unique cell population. Cell types are generally defined by marker that identified the population. This marker can be a genetic marker, or protein expression or morphological to give a few non-limiting examples. Separating cell populations is well known in the art, and can be performed, for example, with magnetic beads, by gradient separation, or by FACS sorting.
In some embodiments, the atlas is constructed with data from cell types from tissue. In some embodiments, the cell types are purified populations from a tissue or organ. In some embodiments, the atlas comprises at least 2 purified populations from the same tissue. In some embodiments, the atlas is constructed with data from purified populations of blood derived cells and tissue derived cells. In some embodiments, the atlas is constructed only from blood derived cells or only tissue derived cells.
In some embodiments, the DNA methylation data is genome wide data. In some embodiments, the DNA methylation data is from a part of the genome. In some embodiments, the DNA methylation data is at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% or 90% of all CpG sites in the genome. Each possibility represents a separate embodiment of the invention. In some embodiments, the DNA methylation data is from a DNA methylation chip or array. Methylation chips, such as for example the Illumina Infinium Human Methylation 450K Beadchip array and the Infinium Human Methylation EPIC Beadchip array, are well known in the art and may be used to provide DNA methylation data for the methods of the invention. In some embodiments, the DNA methylation data is from at least 100000, 150000, 200000, 250000, 30000, 350000, 40000, 450000, or 500000 genomic loci. Each possibility represents a separate embodiment of the invention.
In some embodiments, the method comprises selecting at least the top 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000 or 5000 most uniquely methylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention. In some embodiments, the method comprises selecting at least 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, or 100 of the top 100 most uniquely methylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention. In some embodiments, the method comprises selecting at least the top 100 most uniquely methylated sites in a tissue. In some embodiments, the method further comprises selecting any CpG sites within at least 50, 100, 150, 200, 250, 300, 400, 500, 1000, 1500, or 2000 base pairs upstream and/or downstream of the most uniquely methylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention.
In some embodiments, the method comprises selecting at least the top 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000 or 5000 most uniquely unmethylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention. In some embodiments, the method comprises selecting at least 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, or 100 of the top 100 most uniquely unmethylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention. In some embodiments, the method comprises selecting at least the 100 most uniquely unmethylated sites in a tissue. In some embodiments, the method further comprises selecting any CpG sites within at least 50, 100, 150, 200, 250, 300, 400, 500, 1000, 1500, or 2000 base pairs upstream and/or downstream of the most uniquely unmethylated sites in a tissue or in each tissue. Each possibility represents a separate embodiment of the invention.
In some embodiments, the method further comprises selecting at least the top 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, 250, 300, 350, 400, 450, or 500 CpG sites that best differentiate between the most similar pairs of tissues and/or cell types. Each possibility represents a separate embodiment of the invention. In some embodiments, the analysis of which are the best CpGs is performed iteratively, such that a new decision of which pair of tissues or cell types is most similar is made after each new CpG is added. One skilled in the art will understand that with greater numbers of tissues/cell types in the atlas, and with more similar tissues/cell types most of these informative CpGs may be added.
In some embodiments, the DNA methylation data has been preprocessed to remove unreliable CpG sites. In some embodiments, an unreliable site is a site with less than 3 beads. In some embodiments, an unreliable site has a P-value representing the total fluorescence of the relevant probes that is below 0.01 or 0.05. In some embodiments, an unreliable site has a median absolute error of below 0.05.
As used herein, the term “about” when combined with a value refers to plus and minus 10% of the reference value. For example, a length of about 1000 nanometers (nm) refers to a length of 1000 nm+−100 nm.
It is noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a polynucleotide” includes a plurality of such polynucleotides and reference to “the polypeptide” includes reference to one or more polypeptides and equivalents thereof known to those skilled in the art, and so forth. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.
In those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination. All combinations of the embodiments pertaining to the invention are specifically embraced by the present invention and are disclosed herein just as if each and every combination was individually and explicitly disclosed. In addition, all sub-combinations of the various embodiments and elements thereof are also specifically embraced by the present invention and are disclosed herein just as if each and every such sub-combination was individually and explicitly disclosed herein.
Additional objects, advantages, and novel features of the present invention will become apparent to one ordinarily skilled in the art upon examination of the following examples, which are not intended to be limiting. Additionally, each of the various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below finds experimental support in the following examples.
Various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below find experimental support in the following examples.
Generally, the nomenclature used herein, and the laboratory procedures utilized in the present invention include molecular, biochemical, microbiological and recombinant DNA techniques. Such techniques are thoroughly explained in the literature. See, for example, “Molecular Cloning: A laboratory Manual” Sambrook et al., (1989); “Current Protocols in Molecular Biology” Volumes I-III Ausubel, R. M., ed. (1994); Ausubel et al., “Current Protocols in Molecular Biology”, John Wiley and Sons, Baltimore, Md. (1989); Perbal, “A Practical Guide to Molecular Cloning”, John Wiley & Sons, New York (1988); Watson et al., “Recombinant DNA”, Scientific American Books, New York; Birren et al. (eds) “Genome Analysis: A Laboratory Manual Series”, Vols. 1-4, Cold Spring Harbor Laboratory Press, New York (1998); methodologies as set forth in U.S. Pat. Nos. 4,666,828; 4,683,202; 4,801,531; 5,192,659 and 5,272,057; “Cell Biology: A Laboratory Handbook”, Volumes I-III Cellis, J. E., ed. (1994); “Culture of Animal Cells—A Manual of Basic Technique” by Freshney, Wiley-Liss, N. Y. (1994), Third Edition; “Current Protocols in Immunology” Volumes I-III Coligan J. E., ed. (1994); Stites et al. (eds), “Basic and Clinical Immunology” (8th Edition), Appleton & Lange, Norwalk, Conn. (1994); Mishell and Shiigi (eds), “Strategies for Protein Purification and Characterization—A Laboratory Course Manual” CSHL Press (1996); all of which are incorporated by reference. Other general references are provided throughout this document.
All DNA methylation profiles were profiled either on the Illumina Infinium Human Methylation 450K Beadchip array or the Infinium Human Methylation EPIC Beadchip array. DNA methylation data for white blood cells (neutrophils, monocytes, eosinophils, B-cells, CD4+ T-cells, CD8+ T-cells, NK-cells, n=6 each) were downloaded from GSE35069 (450K). Data for erythrocyte progenitors (n=5) were downloaded from GSE63409 (450K), and data for left atrium (n=4) were downloaded from GSE62727 (450K). Data for bladder (n=19), breast (n=98), cervix (n=3), colon (n=38), esophagus (n=16), head and neck (n=34), kidney (n=160), prostate (n=50), rectum (n=7), stomach (n=2), thyroid (n=56) and uterus (n=34) were downloaded from TCGA (450K). DNA methylation data for adipocytes (n=3, 450K), hepatocytes (n=2, 450K and EPIC), alveolar lung cells (n=2, EPIC), neurons (n=1, 450K), vascular endothelial cells (n=2, EPIC) and pancreatic acinar cells (n=3, 450K (n=1) and EPIC (n=2)), duct cells (n=3, 450K (n=1) and EPIC (n=2)), beta cells (n=5, 450K (n=3) and EPIC (n=2)) were generated from samples. After construction of the atlas used in the experiment described hereinbelow, data was also compiled for lung bronchial cells, cholangiocytes, muscle, oligodendrocytes and ovary samples.
Cancer-free primary human tissue was obtained from consenting donors, dissociated to single cells, sorted using cell type-specific antibodies, and lysed to obtain genomic DNA, from which 250 ng were applied to an Illumina EPIC methylation array. Adipocytes, cortical neurons, hepatocytes, pancreatic acinar cells, pancreatic beta cells and duct cells were obtained from cadaveric donors, as was distal lung tissue. Alveolar epithelial cells were isolated from the lung by FACS using an antibody for EpCAM. Vascular endothelial cells were isolated using anti-CD31 magnetic beads from the saphenous vein, following surgically excision.
Donors were consented and whole blood (usually 20 ml) was drawn, collected into an EDTA tube, and spun quickly to separate plasma, which was stored at −20 c until isolation of cfDNA.
250 ng was collected from each sample, except where otherwise specified. DNA concentration was measured with Qubit. cfDNA extraction from plasma was performed with the QIAsymphony liquid handling robot.
Methylation array data were processed with the minfi package in R. For each sample analyzed on the Illumina Methylation array, CpG sites were filtered out if they were represented by less than 3 beads on the array, if the detection P-value, representing total fluorescence of the relevant probes, was lower than 0.01 or if they mapped to a sex chromosome. Background correction and normalization were performed with the preprocess Illumina function, which removes background calculated based on internal control probes and normalizes all samples to a predetermined control sample.
As the reference database included samples analyzed with two highly similar yet not identical platforms, the Illumina 450K array and the Illumina EPIC array, sites with low reproducibility between the platforms were identified and removed. To this end, data from samples analyzed on both platforms were collected: 15 samples from GSE86833, 12 samples from GSE92580, 1 sample from our generated dataset (hepatocytes). For each overlapping CpG, the median absolute error (MAE) between the 450K samples and the corresponding EPIC samples was calculated, and 37747 CpGs with a MAE<0.05 were removed.
For finding cell-type specific CpGs, after obtaining a matrix X of N cpgs by K cell types, where each column represents the average methylation of all samples of that cell type at each CpGs, there was then created a matrix X′ where each row Xi′ was calculated as
For each cell type k, the 5000 rows with the highest value in column k as uniquely hypermethylated CpGs were identified. To identify uniquely hypomethylated CpGs, a similar process was performed: a matrix X″=1−X was defined, then a matrix
was defined and again the 5000 rows with the highest value in column k as uniquely hypomethylated CpGs were identified. For each cell type, the top 5000 hypomethylated CpGs and unmethylated CpGs were included in the reference matrix.
To calculate the relative contribution of each cell type to a given sample, non-negative least squares were performed, as implemented in the nnls package in R. Given a matrix X of reference methylation values with N CpGs and K cell types, and a vector y of methylation values of length N, non-negative coefficients β′ were identified by solving argminβ∥Xβ−y∥2, subject to β≥0. The resulting β was adjusted to have a sum of 1, where for each βk it was defined
to obtain absolute levels of cfDNA (genome equivalent/ml) per cell type, the resulting βk′ was multiplied by the total concentration of cfDNA present in the sample, as measured by Qubit.
4 leukocyte samples were profiled with Illumina methylation arrays (3 with 450K, 1 with EPIC). For each cell type, each available sample was mixed with each leukocyte sample in ratios 0%:100%, 0.1%:99.9%, 0.2%:99.8%, . . . , 100%:0%. Then for each cell type, at each ratio, the average predicted percentage was calculated as well as the 90% confidence interval.
Pooled DNA samples were obtained by mixing DNA from several individuals. DNA was extracted from 8 ml of plasma, and samples were added until 250 ng reached (5-10 samples per pool). No individual contributed more than 2 times as much DNA to a pool than another individual.
Herein the ability to identify the cellular contributors to cfDNA in healthy and pathological conditions is evaluated in an unbiased manner, using an extensive reference atlas of human cell- and tissue-specific methylomes. The Illumina Infinium Methylation Array was used, which allows for the simultaneous analysis of the methylation status of >450,000 CpG sites throughout the human genome. Illumina methylation arrays have already been used in the deconvolution of whole blood methylation profiles to determine the relative proportions of white blood cells in a sample, a crucial step in Epigenome-Wide Association Studies (EWAS). However, to date, this deconvolution approach has been applied only to whole blood samples, where all contributing cells are well known types of white blood cells. By contrast, the plasma methylome presents major challenges. First, the Illumina arrays require 250-500 ng DNA, which in healthy individuals can be obtained from 100-200 ml blood, much above the standard in blood tests. Second, both the cellular sources and their relative contributions to cfDNA are not known, complicating the computational problem of accurate deconvolution. It was hypothesized that by generating a comprehensive database of methylation profiles of human tissues and cell types, it would be possible to deconvolute the methylation profiles of plasma-derived cfDNA, and hence to infer the cellular contributions to cfDNA from a wide range of cell types. The major tissues contributing to cfDNA of healthy individuals was determined, as well as in several pathologies known to involve an increase in circulating cfDNA: Organ transplantation, sepsis and cancer.
To obtain a comprehensive DNA methylation database of human cell types, datasets which were previously published, either as part of The Cancer Genome Atlas (TCGA) (Weisenberger, 2014) or by individual groups that deposited data in the Gene Expression Omnibus (GEO) were employed. In selecting datasets to be included in the database, the following criteria were used: 1) use primary tissue sources, which have not been passaged in culture (reasoning that culture may change methylation patterns or alter the cellular composition of a mixed tissue, e.g. enrich for fibroblasts); 2) use the methylomes of healthy human tissues, which are expected to be universally conserved; and 3) whenever possible use methylomes of cell types, rather than whole tissues. It was reasoned that since all tissues are a composite of multiple cell types (e.g. different types of epithelial cells, blood, vasculature and fibroblasts), methylation signatures of minority populations might be difficult to identify, and unique signatures of the tissue might be masked by the methylome of stroma.
Unfortunately, other than isolated blood cell types, the vast majority of publicly available methylomes come from bulk tissue. Therefore, methylation profiles of key human cell types which had not been previously published were generated, and these were included in the database. Primary human adipocytes, cortical neurons, hepatocytes, lung alveolar cells, pancreatic beta cells, pancreatic acinar cells, pancreatic duct cells, and vascular endothelial cells were isolated. As detailed in the hereinabove Material and Methods, surgical samples from each tissue were enzymatically dissociated according to published protocols, stained with antibodies against the cell type of interest, and sorted using either flow cytometry (FACS) or magnetic beads (MACS). DNA from sorted cells was then prepared, and the methylome obtained using the Illumina 450 k or EPIC array platforms. The result of this effort was a human methylome reference atlas, composed of 29 tissues or cell types (
To deconvolute cfDNA methylation profiles, a subset of CpG sites in the genome which are differentially methylated among the cell types and tissues in the atlas were selected. Only a subset of the methylome was selected to be used for deconvolution based on several considerations. Most CpG sites represented in the Illumina arrays have a similar methylation patters across all tissues; as a result, the methylation score of such sites is not informative, but may add noise to the deconvolution algorithm.
For each tissue 5000 CpG sites were selected that had the highest methylation score in that tissue compared with all other tissues, and the 5000 CpG sites that were most hypomethylated in that tissue were also selected (see Materials and Methods). From those possible 10,000 the 100 most hyper-methylated and the 100 most hypomethylated for each tissue (Table 1) were selected for deconvolution analysis. To these 200 sites from each tissue, any CpG within 150 base-pairs upstream or downstream of a selected CpG was added (Table 2). As this was done on a per CpG bases the number of sites in each tissue varied slightly. Lastly, 500 additional CpGs (Table 3) were added. These CpGs were iteratively selected as the most differentially methylated CpG site between the most similar pair of cell types, at each iteration.
Notably, other feature selection approaches are also possible; for example, it has been reported that others have relied mostly on a subset of methylation sites that are most variable among tissues; however, it was found that tissue-specific sites are superior to tissue-variable sites in deconvolution performance (
The Tissues/Cell Types are listed as follows: 1=Monocytes, 2=B-cells, 3=CD4+ T-cells, 4=NK-cells, 5=CD8+ T-cells, 6=Eosinophils, 7=Neutrophils, 8=Erythrocyte Progenitors, 9=Adipocytes, 10=Neurons, 11=Hepatocytes, 12=Lung alveolar cells, 13=Pancreatic Beta cells, 14=Pancreatic acinar cells, 15=Pancreatic duct cells, 16=vascular endothelial cells, 17=Left atrium, 18=bladder, 19=Breast, 20=Cervix, 21=Colon, 22=Esophagus, 23=Oral cavity, 24=Kidney, 25=Prostate, 26=Rectum, 27=Stomach, 28=Thyroid, 29=Uterus, 30=Lung bronchial cells, 31=Cholangiocytes, 32=Muscle, 33=Oligodendrocytes, and 34=Ovary.
MU=Most uniquely hyper/hypo-methylated, N=Neighboring
For deconvolution, it was assumed that a cfDNA methylation profile is a linear combination of the methylation profiles of the cell types which contribute to cfDNA. Under this assumption, the relative contributions of different cell types can be determined using non-negative least squares regression (NNLS) (see illustration of the process in
in silico experiments were performed initially to assess performance of the deconvolution approach in determining the relative contributions of various cell types to a methylation profile of DNA from a heterogeneous mixture of cell types. DNA methylation profiles of individual samples of cell types and tissues were computationally mixed with DNA methylation profiles of leukocytes (
As expected, purified cell types mixed into blood methylomes were more easily detected than whole tissues, which represents heterogeneous mixtures of different cell types (
To determine the main contributors to cfDNA in healthy individuals, plasma was collected from multiple healthy donors, and the donors were classified as male or female, young (age 19-30) or elder (age 75+), and cfDNA present in plasma according to these groups was pooled, to reach 250 ng cfDNA in each pool. Additionally, in a few cases blood draws of ˜100 ml from a single donor were possible, so as to avoid pooling. Methylation profiles of the samples were then obtained using the Illumina EPIC arrays and deconvolution analysis was performed. The predicted distribution of tissue sources was very similar among the different pools and between individual healthy donors (
As the analysis of cfDNA is emerging as an important tool in the monitoring of organ transplant recipients, the potential for a cfDNA methylome deconvolution approach to identify DNA from a transplanted organ in circulation was evaluated. The methylome profile generated from pooled cfDNA of type 1 diabetes patients 1 hour after receiving a cadaveric islet transplant was determined. The total concentration of cfDNA in these samples was ˜20× higher than healthy control levels, suggesting a massive process of cell death shortly after islet transplantation. The deconvolution algorithm identified that a large proportion of the DNA (˜17%) was of pancreatic origin, both beta cells and acinar cells (
Additional control experiments were performed to assess assay validity. First, the predictions of the deconvolution algorithm for pancreatic cfDNA in the plasma of patients was assessed before (<1 day), 1 hour after, and 2 hours after transplantation. Negligible levels of pancreas cfDNA were found before islet transplantation, and a large increase immediately after transplantation, and a subsequent decrease in levels of pancreatic cfDNA, was also observed as expected (
Finally, the performance of the deconvolution algorithm was tested when using a reference matrix containing either a whole pancreas methylome or pancreatic cell specific methylomes (acinar, duct and beta cells). Consistent with results from deconvolution of in silico mixes (
An increase in total cfDNA levels in septic patients has been previously documented, and even shown to have a prognostic value. However, it is unclear which cell types are contributing to elevated cfDNA. Therefore, the cfDNA methylation profile of 15 samples from patients with sepsis were analyzed. The main contributors to the increase in cfDNA in these patients relative to healthy levels were leukocytes (mainly neutrophils) in most cases (12/15) (
cfDNA methylation profiles of three patients with metastatic colon cancer were analyzed, all of whom presented with elevated overall levels of cfDNA compared to healthy individuals. In these cases, most of the increase in cfDNA could be defined as gastrointestinal in origin (
To assess the reproducibility of the deconvolution analysis, three cfDNA samples were assayed in duplicate (
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
This application claims the benefit of priority of U.S. Provisional Patent Application No. 62/631,791, filed Feb. 18, 2018, and U.S. Provisional Patent Application No. 62/661,179, filed Apr. 23, 2018, the contents of which are all incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IL2019/050196 | 2/18/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62631791 | Feb 2018 | US | |
62661179 | Apr 2018 | US |