The present disclosure relates broadly to a method of estimating a disease burden, such as a circulating tumor DNA (ctDNA) burden, and related kits and methods.
Cell-free DNA (cfDNA) is present in the blood circulation of humans. In healthy individuals, the death of normal cells of the hematopoietic lineage is the main contributor of plasma cfDNA. In cancer patients, blood plasma can carry circulating tumor DNA (ctDNA) fragments originating from tumor cells, offering non-invasive access to somatic genetic alterations in tumors. The ctDNA profile of a cancer patient is clinically informative in at least two major ways. Firstly, the profile can provide information about specific actionable mutations that can guide therapy. Secondly, the profile can be used to infer tumor growth dynamics by estimating the amount of ctDNA in the blood. This latter information offers a promising non-invasive approach to track disease progression during clinical trials or therapy, offering a real-time tool to adjust therapy.
Existing next-generation sequencing-based approaches to estimate ctDNA levels in plasma samples are based on somatic single nucleotide variant allele frequencies (SNV VAFs), copy number aberrations (CNAs), or DNA methylation patterns. However, these approaches each have limitations.
Approaches based on somatic variant allele frequencies only work for patients that have known recurring cancer mutations, and reliable estimation requires that multiple mutations are present in the ctDNA. Since cfDNA targeted sequencing typically only covers a few hundred selected cancer genes because of the need for ultra-deep sequencing (˜10,000×), most patients will not have a sufficient number of detectable mutations to allow reliable tumor content estimation. ctDNA burden estimation based on SNVs may therefore be challenging when no clonal mutations exist among the targeted genes.
Alternatively, low-pass whole genome sequencing (Ip-WGS) yields segmental/arm-level CNAs, or epigenomics-associated fragmentation patterns that allow for inference of ctDNA burden. However, some cancers may not have sufficient levels of aneuploidy and chromosomal instability needed for robust estimation. Therefore, some cancers cannot be accurately monitored with this approach. Furthermore, low-pass whole genome sequencing approaches only work down to ˜3% tumor ctDNA fraction and the assay must be performed in addition to the standard targeted panel sequencing, wasting precious blood plasma.
Sequencing of DNA methylation patterns may provide a general approach to quantify the cellular origin of cfDNA. However, this technology is less efficient and more noisy (due to bisulfite conversion step) and is again not directly compatible with standard targeted panel sequencing, thereby wasting precious blood plasma.
Notably, both DNA methylation and Ip-WGS profiling require separate assays in addition to standard targeted gene sequencing, highlighting the need for approaches that simultaneously allow for profiling of actionable cancer mutations and quantitative estimation of ctDNA burden.
Thus, there is a need to provide an alternative method of estimating a disease burden, such as a ctDNA burden, and related kits and methods.
In one aspect, there is provided a method of estimating a circulating tumor DNA (ctDNA) burden in a subject, the method comprising: determining in a blood sample obtained from the subject, a level of cell-free DNA (cfDNA) that maps to one or more nucleosome-depleted region (NDR); and estimating the ctDNA burden based on said level of cfDNA, wherein said NDR (i) comprises the NDR of a gene which transcript is differentially expressed between healthy blood tissue and tumor tissue and/or (ii) is degraded to different extents between healthy blood tissue and blood tissue of a tumor-bearing subject.
In one embodiment, determining in a blood sample obtained from the subject, a level of cfDNA that maps to one or more NDR comprises: sequencing cfDNA fragments in the blood sample to obtain sequencing reads; and determining the number of sequencing reads that align with the one or more NDR to obtain said level of cfDNA that maps to one or more NDR.
In one embodiment, the method further comprises contacting the blood sample with one or more probe capable of binding to the one or more NDR to capture cfDNA fragments comprising the one or more NDR prior to the sequencing step.
In one embodiment, the NDR is selected from the group consisting of: a promoter region, a first exon-intron junction and combinations thereof.
In one embodiment, the estimated ctDNA burden positively correlates with a tumor burden in the subject.
In one embodiment, said transcript that is differentially expressed between healthy blood tissue and tumor tissue comprises a transcript which FPKM (Fragments Per Kilobase of transcript per Million) value differs by at least 10 times between healthy blood tissue and tumor tissue.
In one embodiment, said NDR that is degraded to different extents between healthy blood tissue and blood tissue of a tumor-bearing subject comprises a NDR having different sequencing coverage in healthy blood tissue and in tumor tissue.
In one embodiment, said transcript that is differentially expressed in healthy blood tissue and tumor tissue is selected from the group consisting of: a transcript that is more highly expressed in healthy blood tissue than in tumor tissue, a transcript that is more highly expressed in tumor tissue than in healthy blood tissue and combinations thereof.
In one embodiment, said transcript which is differentially expressed between blood tissue and tumor tissue consists of transcript(s) that is more highly expressed in blood tissue than in tumor tissue.
In one embodiment, the one or more NDR comprises at least two NDRs, optionally six NDRs, further optionally ten NDRs.
In one embodiment, the total length of the one or more NDR is no more than 30 kb.
In one embodiment, the one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SLC11A1, NLRP12, PRTN3, HMBS, LILRB3, ACSL1, GP9, MX2, RASGRP4, ATG16L2 and combinations thereof.
In one embodiment, the method is a method of determining disease progression in a subject and the method further comprises: determining in a subsequent blood sample obtained from the subject, a level of cfDNA that maps to one or more NDR; estimating the ctDNA burden based on said level of cfDNA; comparing the ctDNA burden estimated from said subsequent blood sample with the ctDNA burden estimated from said blood sample; and identifying the subject as having disease progression if the ctDNA burden estimated from said subsequent blood sample is higher than the ctDNA burden estimated from said blood sample and identifying otherwise if the ctDNA burden estimated from said subsequent blood sample is not higher than the ctDNA burden estimated from said blood sample.
In one embodiment, the method further comprises changing the treatment regimen received by the subject if the subject is identified as having disease progression.
In one embodiment, the tumor comprises colorectal tumor.
In one embodiment, the one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SHKBP1, ACSL1, BCAR1, RAB25, PRTN3, LSR and combinations thereof.
In one aspect, there is provided a kit for estimating a ctDNA burden in a subject, the kit comprising one or more probe that is capable of binding to one or more NDR, wherein said NDR (i) comprises the NDR of a gene which transcript is differentially expressed between healthy blood tissue and tumor tissue and/or (ii) is degraded to different extents between healthy blood tissue and blood tissue of a tumor-bearing subject.
In one embodiment of the kit, said one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SLC11A1, NLRP12, PRTN3, HMBS, LILRB3, ACSL1, GP9, MX2, RASGRP4, ATG16L2 and combinations thereof.
In one embodiment of the kit, said tumor comprises colorectal tumor and said one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SHKBP1, ACSL1, BCAR1, RAB25, PRTN3, LSR and combinations thereof.
In one embodiment of the kit, the one or more probe comprises the sequence of one or more of SEQ ID NO: 1 to SEQ ID NO: 577, or a sequence sharing at least 75% sequence identity thereto.
The term “treatment”, “treat” and “therapy”, and synonyms thereof as used herein refer to both therapeutic treatment and prophylactic or preventative measures, wherein the object is to prevent or slow down (lessen) a medical condition, which includes but is not limited to diseases (such as cancer), symptoms and disorders. A medical condition also includes a body's response to a disease or disorder, e.g. inflammation. Those in need of such treatment include those already with a medical condition as well as those prone to getting the medical condition or those in whom a medical condition is to be prevented.
The term “subject” as used herein includes patients and non-patients. The term “patient” refers to individuals suffering or are likely to suffer from a medical condition such as cancer, while “non-patients” refer to individuals not suffering and are likely to not suffer from the medical condition. “Non-patients” include healthy individuals, non-diseased individuals and/or an individual free from the medical condition. The term “subject” includes humans and animals. Animals include murine and the like. “Murine” refers to any mammal from the family Muridae, such as mouse, rat, and the like.
The term “micro” as used herein is to be interpreted broadly to include dimensions from about 1 micron to about 1000 microns.
The term “nano” as used herein is to be interpreted broadly to include dimensions less than about 1000 nm.
The term “particle” as used herein broadly refers to a discrete entity or a discrete body. The particle described herein can include an organic, an inorganic or a biological particle. The particle used described herein may also be a macro-particle that is formed by an aggregate of a plurality of sub-particles or a fragment of a small object. The particle of the present disclosure may be spherical, substantially spherical, or non-spherical, such as irregularly shaped particles or ellipsoidally shaped particles. The term “size” when used to refer to the particle broadly refers to the largest dimension of the particle. For example, when the particle is substantially spherical, the term “size” can refer to the diameter of the particle; or when the particle is substantially non-spherical, the term “size” can refer to the largest length of the particle.
The terms “coupled” or “connected” as used in this description are intended to cover both directly connected or connected through one or more intermediate means, unless otherwise stated.
The term “associated with”, used herein when referring to two elements refers to a broad relationship between the two elements. The relationship includes, but is not limited to a physical, a chemical or a biological relationship. For example, when element A is associated with element B, elements A and B may be directly or indirectly attached to each other or element A may contain element B or vice versa.
The term “adjacent” used herein when referring to two elements refers to one element being in close proximity to another element and may be but is not limited to the elements contacting each other or may further include the elements being separated by one or more further elements disposed therebetween.
The term “and/or”, e.g., “X and/or Y” is understood to mean either “X and Y” or “X or Y” and should be taken to provide explicit support for both meanings or for either meaning.
Further, in the description herein, the word “substantially” whenever used is understood to include, but not restricted to, “entirely” or “completely” and the like. In addition, terms such as “comprising”, “comprise”, and the like whenever used, are intended to be non-restricting descriptive language in that they broadly include elements/components recited after such terms, in addition to other components not explicitly recited. For example, when “comprising” is used, reference to a “one” feature is also intended to be a reference to “at least one” of that feature. Terms such as “consisting”, “consist”, and the like, may in the appropriate context, be considered as a subset of terms such as “comprising”, “comprise”, and the like. Therefore, in embodiments disclosed herein using the terms such as “comprising”, “comprise”, and the like, it will be appreciated that these embodiments provide teaching for corresponding embodiments using terms such as “consisting”, “consist”, and the like. Further, terms such as “about”, “approximately” and the like whenever used, typically means a reasonable variation, for example a variation of +/−5% of the disclosed value, or a variance of 4% of the disclosed value, or a variance of 3% of the disclosed value, a variance of 2% of the disclosed value or a variance of 1% of the disclosed value.
Furthermore, in the description herein, certain values may be disclosed in a range. The values showing the end points of a range are intended to illustrate a preferred range. Whenever a range has been described, it is intended that the range covers and teaches all possible sub-ranges as well as individual numerical values within that range. That is, the end points of a range should not be interpreted as inflexible limitations. For example, a description of a range of 1% to 5% is intended to have specifically disclosed sub-ranges 1% to 2%, 1% to 3%, 1% to 4%, 2% to 3% etc., as well as individually, values within that range such as 1%, 2%, 3%, 4% and 5%. It is to be appreciated that the individual numerical values within the range also include integers, fractions and decimals. Furthermore, whenever a range has been described, it is also intended that the range covers and teaches values of up to 2 additional decimal places or significant figures (where appropriate) from the shown numerical end points. For example, a description of a range of 1% to 5% is intended to have specifically disclosed the ranges 1.00% to 5.00% and also 1.0% to 5.0% and all their intermediate values (such as 1.01%, 1.02% . . . 4.98%, 4.99%, 5.00% and 1.1%, 1.2% . . . 4.8%, 4.9%, 5.0% etc.,) spanning the ranges. The intention of the above specific disclosure is applicable to any depth/breadth of a range.
Additionally, when describing some embodiments, the disclosure may have disclosed a method and/or process as a particular sequence of steps. However, unless otherwise required, it will be appreciated that the method or process should not be limited to the particular sequence of steps disclosed. Other sequences of steps may be possible. The particular order of the steps disclosed herein should not be construed as undue limitations. Unless otherwise required, a method and/or process disclosed herein should not be limited to the steps being carried out in the order written. The sequence of steps may be varied and still remain within the scope of the disclosure.
Furthermore, it will be appreciated that while the present disclosure provides embodiments having one or more of the features/characteristics discussed herein, one or more of these features/characteristics may also be disclaimed in other alternative embodiments and the present disclosure provides support for such disclaimers and these associated alternative embodiments.
Exemplary, non-limiting embodiments of a method of estimating a disease burden, such as a ctDNA burden, in a subject and related kits and methods are disclosed hereinafter.
In various embodiments, there is provided a method of estimating, predicting and/or determining one or more of: a disease burden, a cancer burden, a tumor burden, a circulating tumor DNA (ctDNA) burden, a level of ctDNA, an amount of ctDNA, a proportion of ctDNA, a fraction of ctDNA and a ctDNA content in a subject. In various embodiments, the method comprises determining in a sample obtained from the subject, a level, an amount, a proportion, a fraction and/or a content of DNA, optionally cell-free DNA (cfDNA), that aligns with, belongs to, maps to, corresponds to, is similar to and/or identical to at least one genomic region, and estimating, predicting and/or determining one or more of: the disease burden, the cancer burden, the tumor burden, the ctDNA burden, the level of ctDNA, the amount of ctDNA, the proportion of ctDNA, the fraction of ctDNA and/or the ctDNA content in the subject based on the level, the amount, the proportion, the fraction and/or the content of DNA. In some embodiments, the disease burden, the cancer burden, the tumor burden, the ctDNA burden, the level of ctDNA, the amount of ctDNA, the proportion of ctDNA, the fraction of ctDNA and/or the ctDNA content comprises the absolute disease burden, cancer burden, tumor burden, ctDNA burden, level of ctDNA, amount of ctDNA, proportion of ctDNA, fraction of ctDNA and/or ctDNA content.
The estimation, prediction and/or determination may be quantitative, semi-quantitative or qualitative. In various embodiments, the disease burden, the cancer burden, the tumor burden, the ctDNA burden, the level of ctDNA, the amount of ctDNA, the proportion of ctDNA, the fraction of ctDNA and/or the ctDNA content is associated with or correlates with the level, the amount, the proportion, the fraction and/or the content of DNA, optionally cfDNA, in the subject.
In various embodiments, the at least one genomic region comprises a gene. In various embodiments, the at least one genomic region comprises a coding region. In various embodiments, the at least one genomic region comprises a non-coding region (e.g. a region that is far away from genes, a regulatory region such as enhancer etc.). In various embodiments, the at least one genomic region comprises a nucleosome-depleted region (NDR). In various embodiments, the nucleosome-depleted region comprises a gene. In various embodiments, the nucleosome-depleted region comprises a coding region. In various embodiments, the nucleosome-depleted region comprises a non-coding region.
In various embodiments, the at least one genomic region comprises at least one coding region/gene and at least one non-coding region. In various embodiments, determining in the sample a level (or an amount, a proportion, a fraction and/or a content) of DNA that maps to (or aligns with, corresponds to, belongs to, is similar to and/or is identical to) at least one genomic region comprises determining a level of DNA that maps to each of a plurality of genomic regions, the plurality of genomic regions comprising a greater number/proportion of coding region(s)/gene(s) than non-coding region(s). In other words, in various embodiments, the non-coding regions make up a small/minority set of the plurality of regions that are being mapped to.
A NDR may be a region that has a relatively low nucleosome occupancy level. For example, a promoter region upstream of a transcriptional start site (TSS) often displays low nucleosome occupancy level for a typical gene. For example, regulatory regions tend to be nucleosome depleted. In various embodiments, the at least one NDR comprises a transcription start site, a promoter, an intron-exon junction and/or an exon-intron junction. An intron-exon junction may be a first intron-exon junction, a second intron-exon junction, a third intron-exon junction, a fourth intron-exon junction etc. An exon-intron junction may be a first exon-intron junction, a second exon-intron junction, a third exon-intron junction, a fourth exon-intron junction etc. In various embodiments, the NDR is selected from the group consisting of: a promoter region, a first exon-intron junction and combinations thereof. In various examples, cfDNA coverage/degradation pattern at a first exon-intron junction and/or a promoter region is found to possess the capability or better capability to infer gene expression and/or predict ctDNA burden.
In various embodiments, the NDR comprises the NDR of a gene which is differentially expressed in healthy blood tissue/cell and diseased tissue/cell. In various embodiments, the NDR comprises the NDR of a gene which transcript is differentially expressed in healthy blood tissue/cell and diseased tissue/cell. Because a gene usually comprises multiple alternative transcripts with different genomic positions, determining the gene expression at the transcript level (as compared to at the gene level) may allow for a more precise mapping of the NDR e.g. the promoter and junction locations. A gene which transcript is differentially expressed in healthy blood tissue/cell and diseased tissue/cell may be identified by RNA sequencing or any other suitable methods known in the art. A gene which transcript is differentially expressed in healthy blood tissue/cell and diseased tissue/cell may also be identified by analysing transcript expression data available at public databases e.g. the Genotype-Tissue Expression (GTEx) project, The Cancer Genome Atlas (TCGA) program etc. A transcript which is differentially expressed in healthy blood tissue/cell and diseased tissue/cell may have different FPKM (fragments per kilobase of transcript per million mapped fragments/reads) or RPKM (Reads Per Kilobase of transcript, per Million mapped reads), or TPM (Transcripts Per Million) values in healthy blood tissue/cell and in diseased tissue/cell (e.g. as determined by sequencing).
In various embodiments, the difference in the expression or FPKM/RPKM/TPM value of the transcript in healthy blood tissue/cell and in diseased tissue/cell is at least about 10%, at least about 20%, at least about 30%, at least about 40%, at least about 50%, at least about 60%, at least about 70%, at least about 80%, at least about 90% or at least about 100%. In various embodiments, the difference in the expression of the transcript FPKM/RPKM/TPM value in healthy blood tissue/cell and in diseased tissue/cell is at least about 0.1 fold, at least about 0.2 fold, at least about 0.3 fold, at least about 0.4 fold, at least about 0.5 fold, at least about 0.6 fold, at least about 0.7 fold, at least about 0.8 fold, at least about 0.9 fold, at least about 1 fold, at least about 2 fold, at least about 3 fold, at least about 4 fold, at least about 5 fold, at least about 6 fold, at least about 7 fold, at least about 8 fold, at least about 9 fold, at least about 10 fold, at least about 11 fold, at least about 12 fold, at least about 13 fold, at least about 14 fold or at least about 15 fold. In various embodiments, the difference in the expression of the transcript FPKM/RPKM/TPM value in healthy blood tissue/cell and in diseased tissue/cell is at least about 0.1 times, at least about 0.2 times, at least about 0.3 times, at least about 0.4 times, at least about 0.5 times, at least about 0.6 times, at least about 0.7 times, at least about 0.8 times, at least about 0.9 times, at least about 1 times, at least about 2 times, at least about 3 times, at least about 4 times, at least about 5 times, at least about 6 times, at least about 7 times, at least about 8 times, at least about 9 times, at least about 10 times, at least about 11 times, at least about 12 times, at least about 13 times, at least about 14 times or at least about 15 times. In various embodiments, the FPKM/RPKM/TPM value comprises a median FPKM/RPKM/TPM value obtained from a plurality of healthy blood tissue/cell samples and/or a plurality of diseased tissue/cell samples.
In various embodiments, the NDR is degraded to different extents in healthy blood tissue/cell and in blood tissue/cell of a diseased subject. In various embodiments, the NDR has different degradation patterns/signals in healthy blood tissue/cell and in blood tissue/cell of a diseased subject. For example, when sequencing cfDNA in a healthy blood tissue/cell sample and in blood tissue/cell sample of a diseased subject, a greater or smaller number/amount (i.e. a substantially different or non-identical number/amount) of fragments/reads may map to the NDR in the healthy blood tissue/cell sample as compared to the blood tissue/cell sample of the diseased subject. For example, when sequencing cfDNA in a healthy blood tissue/cell sample and in a blood tissue/cell sample of a diseased subject, the read depth or coverage of the NDR may be higher or lower in the healthy blood tissue/cell sample as compared to the blood tissue/cell sample of a diseased subject. In various embodiments therefore, the NDR has different (or non-similar or non-identical) read depth or coverage in healthy blood tissue/cell and in blood tissue/cell of a diseased subject.
The read depth or coverage of a NDR may comprise a relative read depth or relative coverage of the NDR. A relative read depth or relative coverage of a NDR may be obtained, for example, by normalizing/dividing the raw read depth/coverage across the NDR (or optionally a mean raw read depth/coverage across the NDR for multiple samples/runs) by a normalization factor. In one example, the normalization factor comprises the read depth or coverage (or optionally a mean read depth/coverage for multiple samples/runs) of region(s) flanking the NDR e.g. the flanking upstream and/or downstream regions. In one example, the normalization factor is the mean coverage of the upstream and downstream flanks of the NDR. In one example therefore, the relative read depth or relative coverage of a NDR is the mean raw read depth/coverage across the NDR divided by the mean raw read depth/coverage of the upstream and downstream flanks.
In some embodiments, the flanking region(s) is immediately upstream or downstream of the NDR, or contiguous with the NDR. In some embodiments, the flanking region(s) is separated from the NDR by one or more nucleotides/bases. In various embodiments, the flanking region(s) is no more than about 5000 base pairs (bp), no more than about 4500 bp, no more than about 4000 bp, no more than about 3500 bp, no more than about 3000 bp, no more than about 2500 bp or no more than about 2000 bp from the NDR or an end of the NDR. In various embodiments, the flanking region(s) is at least about 50 bp, at least about 100 bp, at least about 150 bp, at least about 200 bp, at least about 250 bp, at least about 300 bp, at least about 350 bp, at least about 400 bp, at least about 450 bp, at least about 500 bp, at least about 550 bp, at least about 600 bp, at least about 650 bp, at least about 700 bp, at least about 750 bp, at least about 800 bp, at least about 850 bp, at least about 900 bp, at least about 950 bp, or least about 1000 bp from the NDR or an end of the NDR.
In various embodiments, the size/length of flanking region(s) is at least about 50 bp, at least about 100 bp, at least about 150 bp, at least about 200 bp, at least about 250 bp, at least about 300 bp, at least about 350 bp, at least about 400 bp, at least about 450 bp, at least about 500 bp, at least about 550 bp, at least about 600 bp, at least about 650 bp, at least about 700 bp, at least about 750 bp, at least about 800 bp, at least about 850 bp, at least about 900 bp, at least about 950 bp, or least about 1000 bp.
In one example, the NDR is about −300 bp to about 300 bp, about −200 bp to about 100 bp or about −150 bp to about 50 bp relative to a transcription start site (TSS) and the normalization factor is the mean coverage of an upstream flank that is about −2000 bp to about −1000 bp relative to the TSS and a downstream flank that is about 1000 bp to about 2000 bp relative to the TSS.
In various embodiments, a NDR that is degraded to different extents in healthy blood tissue/cell and in blood tissue/cell of a diseased subject may be identified by comparing the relative depth/coverage of the NDR in healthy blood tissue/cell and in the blood tissue/cell of a diseased subject. For example, if the relative depth/coverage of the NDR in healthy blood tissue/cell and in the blood tissue/cell of a diseased subject are different, the NDR is considered to a NDR that is degraded to different extents in healthy blood tissue/cell and in blood tissue/cell of a diseased subject. In various embodiments, determining the relative depth/coverage of a NDR in healthy blood tissue/cell and/or in blood tissue/cell of a diseased subject comprises determining the coverage of each position in an about 8 k-bp window, about 6 k-bp window, about 4 k-bp window, about 2 k-bp window or about 1 k-bp window spanning from about −4000 to +4000 bp, from about −3000 to +3000 bp, from about −2000 to +2000 bp, from about −1000 to +1000 bp or from about −500 to +500 bp with respect the NDR (e.g. end(s) of the NDR); and optionally normalizing the coverage by the mean coverage of the upstream region (e.g. −8000 to −4000 bp, −4000 to −2000 bp, −3000 to −1000 bp, −2000 to −1000 bp or −1000 to −500 bp with respect to the NDR (e.g. end(s) of the NDR)) and/or downstream region (e.g. +4000 bp to +8000 bp, +2000 to +4000 bp, +1000 to +3000 bp+1000 to +2000 bp or +500 to +1000 bp with respect to the NDR (e.g. end(s) of the NDR) to obtain a relative depth/coverage for the NDR. In some examples, the coverage of each position in a region located downstream of a NDR (e.g. a promoter) is determined. In some examples, the coverage of each position in a region located from about −350 bp to about −50 bp or from about −300 to about −100 bp with respect a NDR (e.g. an end of a first exon) is determined.
In various embodiments, the difference in read depth or coverage (or relative read depth or coverage) in healthy blood tissue/cell and in blood tissue/cell of a diseased subject is measured by computing a coverage score (or relative coverage score). In various embodiments, the coverage score (or relative coverage score) is computed by the following formula:
where mean(diseased) and mean(healthy) are the mean of average coverages (or relative coverages) at NDRs across diseased blood tissue/cell (e.g. plasma samples of diseased subjects) and healthy blood tissue/cell (e.g. healthy plasma samples) respectively, and s.d. (diseased) is the standard deviation of average coverages (or relative coverages) at NDRs across diseased blood tissue/cell.
In various embodiments, the coverage values negatively correlate with expression level. In some examples therefore, blood genes/transcripts (e.g. genes/transcripts show a higher FPKM value in normal blood than in tumor) have a higher coverage in diseased samples than in healthy samples. Thus, the blood genes/transcripts have a positive value of relative coverage score, as mean(diseased)>mean(healthy). In some examples, tumor genes/transcripts (e.g. genes/transcripts show a higher FPKM value in tumor than in normal blood) have a lower coverage in disease samples than in healthy samples. Thus, the tumor genes have a negative value of relative coverage score, as mean(diseased)<mean(healthy).
In various embodiments, the NDR has a coverage score or relative coverage score of less than about 0 and/or more than about 0. In various embodiments, the NDR has a coverage score or relative coverage score of less than about −0.1, less than about −0.2, less than about −0.3, less than about −0.4, less than about −0.5, less than about −0.6, less than about −0.7, less than about −0.8, less than about −0.9 or less than about −1.0. In various embodiments, the NDR has a coverage score or relative coverage score of more than about 0.1, more than about 0.2, more than about 0.3, more than about 0.4, more than about 0.5, more than about 0.6, more than about 0.7, more than about 0.8, more than about 0.9 or more than about 1.0.
As used herein, “blood”, “blood tissue” or “blood sample” refers to whole blood or fractions thereof, such as a plasma fraction or a serum fraction. As used herein “healthy blood”, “healthy blood tissue” or “healthy blood sample” refers to the whole blood or fractions thereof of a healthy subject, or a subject who does not suffer from the disease. Conversely, “diseased blood”, “diseased blood tissue” or “diseased blood sample” as used herein refers to the whole blood or fractions thereof of a diseased subject, or a subject who suffers from the disease. In various embodiments, “diseased blood”, “diseased blood tissue” or “diseased blood sample” does not indicate that a disease necessarily resides in the blood per se. For example, “diseased blood”, “diseased blood tissue” or “diseased blood sample” may refer to the blood, tissue or sample of a subject suffering from colorectal cancer and having no blood diseases, and “healthy blood”, “healthy blood tissue” or “healthy blood sample” may refer to the blood, tissue or sample of a subject who does not suffer from colorectal cancer.
In various embodiments, the sample obtained from the subject comprises a liquid sample. In various embodiments, the sample comprises a biological fluid sample. In various embodiments, the liquid/biological fluid sample comprises one or more of blood, serum, plasma, sputum, lavage fluid, cerebrospinal fluid, interstitial fluid, urine, feces, milk, semen, sweat, tears, saliva, and the like. In various embodiments, the sample comprises a blood sample (e.g. whole blood sample or processed fractions thereof). In various embodiments, the sample comprises a plasma sample. In various embodiments, the sample comprises cfDNA. In various embodiments, the sample comprises cfDNA, for example, cfDNA extracted/isolated/purified from a blood sample obtained from the subject.
In various embodiments, the disease comprises a proliferative disease and the diseased tissue/cell comprises a proliferative tissue/cell. In various embodiments, the disease comprises a malignant disease and the diseased tissue/cell comprises a malignant tissue/cell. In various embodiments, the malignant disease comprises cancer and the diseased tissue/cell comprises a cancer tissue/cell. In various embodiments, the cancer comprises solid tumor cancers.
In various embodiments therefore, there is provided a method of estimating a ctDNA burden in a subject, the method comprising: determining in a blood sample obtained from the subject, a level of cfDNA that maps to one or more nucleosome-depleted region (NDR); and estimating the ctDNA burden based on said level of cfDNA, wherein said NDR (i) comprises the NDR of a gene which transcript is differentially expressed between healthy blood tissue and tumor tissue and/or (ii) is degraded to different extents between healthy blood tissue and blood tissue of a tumor-bearing subject. Advantageously, a level of cfDNA that maps to selected NDR(s) is identified to be a good estimator of or proxy for tumor burden or ctDNA burden.
In various embodiments, the estimated ctDNA burden associates/correlates, optionally positively associates/correlates with a tumor burden in the subject. In various embodiments, the higher the estimated ctDNA burden in the subject, the higher the tumor burden in the subject. In various embodiments, the higher the estimated ctDNA burden in the subject, the higher the estimated amount of cancer/tumor cells in the subject. In various embodiments, the higher the estimated ctDNA burden in the subject, the higher the estimated mass/size/volume of tumor in the subject. In various embodiments, the association/correlation, optionally positive association/correlation, may be linear (i.e. the ratio of change is constant) or non-linear (i.e. the ratio of change is not constant).
In various embodiments, the estimated ctDNA burden is associated/correlated with the level of cfDNA that maps to one or more NDRs. The association/correlation may be positive and/or negative, linear and/or non-linear and monotonic and/or non-monotonic. For example, the estimated ctDNA burden may be positively associated/correlated with the level of cfDNA that maps to a first NDR and negatively associated/correlated with the level of cfDNA burden that maps to a second NDR. For example, the estimated ctDNA burden may be linearly associated/correlated (e.g. positive or negative) with the level of cfDNA that maps to a first NDR and non-linearly associated/correlated with the level of cfDNA burden that maps to a second NDR. For example, the estimated ctDNA burden may be monotonically associated/correlated with the level of cfDNA that maps to a first NDR and non-monotonically associated/correlated with the level of cfDNA that maps to a second NDR.
In one example, the signs of the coefficients for the one or more NDRs in a trained model correspond to the sign of the differential expression of the associated transcripts in tumor tissue relative to healthy blood tissue. In various embodiments, an NDR associated with a cancer-specific gene/transcript or a tumor gene/transcript (e.g. a gene/transcript that shows a higher FPKM value in tumor than in normal blood) has a negative coefficient/correlation with the estimated ctDNA burden. In various embodiments, an NDR associated with a blood gene/transcript (e.g. a gene/transcript that shows a higher FPKM value in normal blood than in tumor) has a positive coefficient/correlation with the estimated ctDNA burden. In various embodiments, the estimated ctDNA burden is negatively associated/correlated with a level of cfDNA that maps to one or more NDR of a gene which transcript is more highly expressed in tumor tissue than in healthy blood tissue and/or the estimated ctDNA burden is positively associated/correlated with a level of cfDNA that maps to one or more NDR of a gene which transcript is more highly expressed in healthy blood tissue than in tumor tissue. In some embodiments, the estimated ctDNA burden is linearly correlated with the level of cfDNA that maps to one or more NDRs.
In various embodiments, the determining step comprises sequencing the DNA or cfDNA present in the blood sample obtained from the subject. Examples of sequencing techniques include next-generation sequencing, amplicon-based sequencing, paired-end sequencing, Sanger sequencing etc. In some embodiments, sequencing the DNA or cfDNA present in the blood sample comprises subjecting the DNA or cfDNA present in the blood sample to deep sequencing. In one embodiment, sequencing the DNA or cfDNA present in the blood sample comprises subjecting the DNA or cfDNA present in the blood sample to next-generation sequencing. In some examples, deep sequencing is performed such that the depth/coverage at the one or more NDR/at least one NDR is at least about 10×, at least about 25×, at least about 50×, at least about 100×, at least about 200×, at least about 300×, at least about 400×, at least about 500×, at least about 600×, at least about 700×, at least about 800×, at least about 900× or at least about 1000×, at least about 2000×, at least about 3000×, at least about 4000×, at least about 5000× or at least about 6000×. In various embodiments, the sequencing does not comprise ultra-deep sequencing. In various embodiments, the depth/coverage at the one or more NDR/at least one NDR is or is kept to less than about 10,000×, less than about 9000×, less than about 8000×, less than about 7000×, less than about 6000×, less than about 5000×, less than about 4000×, less than about 3000×, less than about 2000× or less than about 1000×. In various embodiments, the depth/coverage at the one or more NDR/at least one NDR is or is kept to no more than about 10,000×, no more than about 9000×, no more than about 8000×, no more than about 7000×, no more than about 6000×, no more than about 5000×, no more than about 4000×, no more than about 3000×, no more than about 2000× or no more than about 1000×.
In various embodiments, determining in a blood sample obtained from the subject, a level of cfDNA that maps to one or more NDR comprises sequencing cfDNA/cfDNA fragments in the blood sample to obtain sequencing reads; and determining the number of sequencing reads that align with the one or more NDR to obtain said level of cfDNA that maps to one or more NDR. In various embodiments, determining in a blood sample obtained from the subject, a level of cfDNA that maps to one or more NDR comprises sequencing any cfDNA/cfDNA fragments present in the blood sample and determining the depth/read depth/coverage/sequencing coverage at the one or more NDR. The depth/read depth/coverage may be a relative depth/read depth/coverage/sequencing coverage. For example, the depth/read depth/coverage/sequencing coverage may be normalized/divided by a normalization factor, for example, a normalization factor as described herein, to obtain the relative depth/read depth/coverage/sequencing coverage. In one example, the relative depth/read depth/coverage/sequencing coverage is obtained by dividing/normalizing the depth/read depth/coverage/sequencing coverage (or mean depth/read depth/coverage/sequencing coverage) across the one or more NDR by the depth/read depth/coverage/sequencing coverage (or mean depth/read depth/coverage/sequencing coverage) of an upstream flank and/or a downstream flank, for example, an upstream flanking region and/or a downstream flanking region as described herein. In some embodiments therefore, the method further comprises determining the number of sequencing reads that align with one or more regions flanking the one of more NDR. In some embodiments, the method further comprises determining the depth/read depth/coverage/sequencing coverage at the one or more region flanking the one of more NDR.
The sequencing may be targeted or untargeted. Where the sequencing comprises targeted sequencing, probe(s) may be used to capture and isolate specific genomic regions for sequencing. In some embodiments therefore, the method further comprises contacting the blood sample with one or more probe capable of binding to the one or more NDR to capture cfDNA/cfDNA fragments comprising the one or more NDR prior to the sequencing step.
In various embodiments, determining in a blood sample obtained from the subject, a level of cfDNA that maps to one or more NDR comprises performing quantitative polymerase chain reaction (qPCR) or real-time polymerase chain reaction (real-time PCR) to determine the amount/proportion of cfDNA that maps to one or more NDR. In various embodiments, the performing step comprises contacting the sample with a primer that is capable of hybridizing/binding (e.g. under stringent conditions) to or a primer that is specific to the one or more NDR.
In various embodiments, the method further comprises amplifying the cfDNA in the blood sample. The amplification step may be carried out before the step of determining a level of cfDNA. The amplification step may also be carried out before the step of sequencing cfDNA/cfDNA fragments in the blood sample and/or before the step of contacting the blood sample with the one or more probe. Amplification reactions known in the art may be employed. The amplification reactions may include but are not limited to polymerase chain reaction (PCR), ligase chain reaction (LCR), loop mediated isothermal amplification (LAMP), nucleic acid sequence based amplification (NASBA), self-sustained sequence replication (3SR), rolling circle amplification (RCA) or any other process whereby one or more copies of a particular polynucleotide sequence or nucleic acid sequence may be generated from a polynucleotide template sequence or nucleic acid template sequence.
In various embodiments, the method further comprises processing the cfDNA and/or its associated data. In various embodiments, the cfDNA are trimmed at one or both ends to retain only a central region and/or data associated with a central region of the cfDNA. Advantageously, trimming the cfDNA and/or its associated data from one or both ends to retain only a central region and/or data associated with a central region of the cfDNA may amplify a degradation signal and/or increases a coverage signal. In various embodiments, the trimmed cfDNA/central region is no more than about 70 bp, no more than about 60 bp or no more than about 50 bp in length. In various embodiments, the trimmed cfDNA/central region is about 70 bp, about 60 bp or about 50 bp in length. In one embodiment, the central region is about 61 bp. The method may also work with an untrimmed cfDNA (e.g. a cfDNA of about 151 bp), although the signal produced may be weaker.
In various embodiments, the cfDNA and/or its associated data are trimmed in-silico e.g. by use of the software BamUtil. In various embodiments, the cfDNA and/or its associated data are trimmed after sequencing.
In various embodiments, said NDR that is degraded to different extents between healthy blood tissue and blood tissue of a tumor-bearing subject comprises a NDR having different depth/read depth/coverage/sequencing coverage in healthy blood tissue and in tumor tissue.
In various embodiments, said transcript that is differentially expressed between healthy blood tissue and tumor tissue comprises a transcript which FPKM value differs by at least about 2 times, at least about 3 times, at least about 4 times, at least about 5 times, at least about 6 times, at least about 7 times, at least about 8 times, at least about 9 times or at least about 10 times between healthy blood tissue and tumor tissue (e.g. as determined by sequencing).
In various embodiments, said transcript that is differentially expressed between healthy blood tissue and tumor tissue comprises a transcript which FPKM value in healthy blood tissue is less than about 30, less than about 20, less than about 10, less than about 5, less than about 3, less than about 1, less than about 0.5, less than about 0.1, less than about 0.05 or less than about 0.01. In one embodiment, the FPKM value of the transcript in healthy blood tissue in less than about 1. In various embodiments, the FPKM value of the transcript in healthy blood tissue is more than about 0.01, more than about 0.05, more than about 0.1, more than about 0.5, more than about 1, more than about 3, more than about 5, more than about 10, more than about 20 or more than about 30. In one embodiment, the FPKM value of the transcript in healthy blood tissue is more than about 10. In various embodiments, the FPKM value of the transcript in healthy blood tissue is between about 0.01 and about 0.1, between about 0.1 and about 1, between about 1 and about 5 or between about 5 and about 30.
In various embodiments, said transcript that is differentially expressed between healthy blood tissue and tumor tissue comprises a transcript which FPKM value in tumor tissue is less than about 30, less than about 20, less than about 10, less than about 5, less than about 3, less than about 1, less than about 0.5, less than about 0.1, less than about 0.05 or less than about 0.01. In one embodiment, the FPKM value of the transcript in tumor tissue in less than about 1. In various embodiments, the FPKM value of the transcript in tumor tissue is more than about 0.01, more than about 0.05, more than about 0.1, more than about 0.5, more than about 1, more than about 3, more than about 5, more than about 10, more than about 20 or more than about 30. In one embodiment, the FPKM value of the transcript in tumor tissue is more than about 10. In various embodiments, the FPKM value of the transcript in tumor tissue is between about 0.01 and about 0.1, between about 0.1 and about 1, between about 1 and about 5 or between about 5 and about 30.
Some transcripts may be more highly expressed in healthy blood tissue than tumor tissue. In various embodiments, a blood transcript comprises a transcript that is more highly expressed in healthy blood tissue than tumor tissue. Some transcripts may be more highly expressed in tumor tissue than blood tissue. In various embodiments, a tumor transcript comprises a transcript that is more highly expressed in tumor tissue than blood tissue. The one or more NDR may comprise at least about 10%, at least about 20%, or at least about 30%, at least about 40%, at least about 50%, at least about 60%, at least about 70%, at least about 80%, at least about 90% or about 100% NDRs which transcripts more highly expressed in healthy blood tissue than tumor tissue. The one or more NDR may comprise at least about 10%, at least about 20%, or at least about 30%, at least about 40%, at least about 50%, at least about 60%, at least about 70%, at least about 80%, at least about 90% or at about 100% NDRs which transcripts more highly expressed in tumor tissue than in blood tissue. The one or more NDR may comprise at least about one, at least about two or at least about three NDRs which transcripts more highly expressed in healthy blood tissue than tumor tissue and/or at least about one, at least about two or at least about three NDRs which transcripts are more highly expressed in tumor tissue than in blood tissue.
In various embodiments, said transcript which is differentially expressed in healthy blood tissue and tumor tissue is selected from the group consisting of: a transcript that is more highly expressed in healthy blood tissue than tumor tissue, a transcript that is more highly expressed in tumor tissue than healthy blood tissue and combinations thereof. In one embodiment, said transcript which is differentially expressed between blood tissue and tumor tissue consists of transcript(s) that is more highly expressed in tumor tissue than blood tissue. In one embodiment, said transcript which is differentially expressed between blood tissue and tumor tissue consists of transcript(s) that is more highly expressed in blood tissue than tumor tissue. In one embodiment, said transcript does not comprise a transcript which is more highly expressed in tumor tissue than blood tissue. Without being bound by theory, it is believed that tumor-derived DNA component in cancer plasma weakens the blood-specific DNA degradation pattern, and thus the decay of blood-specific signal (alone i.e. without determining the signal of any tumor-associated genes) may be used to robustly estimate a ctDNA content, regardless of cancer types.
In various embodiments therefore, the method is suitable for estimating a disease burden for a specific cancer type, a specific group of cancers, or for all cancers in general (i.e. pan-cancer). In various embodiments, the method comprises a method of estimating a ctDNA burden or tumor burden associated with one or more of the following cancers: bladder cancer, bladder urothelial carcinoma, breast cancer, breast invasive carcinoma, cervical cancer, cervical squamous cell carcinoma, endocervical adenocarcinoma, colorectal cancer, esophageal cancer, esophageal carcinoma, brain cancer, glioblastoma multiforme, head and neck cancer, head and neck squamous cell carcinoma, kidney cancer, renal cell cancer, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, brain lower grade glioma, liver cancer, liver hepatocellular carcinoma, lung cancer, lung adenocarcinoma, lung squamous cell carcinoma, ovarian cancer, ovarian serous cystadenocarcinoma, pancreatic cancer, pancreatic adenocarcinoma, prostate cancer, prostate adenocarcinoma, skin cancer, skin cutaneous melanoma, gastric cancer, stomach cancer, stomach adenocarcinoma, thyroid cancer, thyroid carcinoma, endometrial cancer, uterine cancer, uterine corpus endometrial carcinoma, reproductive cancers, gastrointestinal cancers, respiratory cancers, or subtypes thereof. Thus, in various embodiments, the subject has or suffers from one or more of these cancers. In various embodiments, the tumor-bearing subject bears one or more of these tumors. In various embodiments, the subject or tumor-bearing subject does not have or does not suffer from blood cancer/hematologic cancer/hematologic malignancy.
In one embodiment, the method comprises a method of estimating a tumor burden associated with colorectal cancer. In one embodiment, the subject has or suffers from colorectal cancer. In one embodiment, the tumor-bearing subject bears a colorectal tumor. In one embodiment, the method comprises a method of estimating a ctDNA burden or tumor burden associated with breast cancer. In one embodiment, the subject has or suffers from breast cancer. In one embodiment, the tumor-bearing subject bears a breast tumor.
In one embodiment, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with a specific cancer type or a specific group of cancers, the NDR comprises at least one NDR of a gene which transcript shows a higher FPKM value in tumor belonging to the specific cancer type or the specific group of cancers than in healthy/normal blood. In some examples, the transcript has a FPKMtumor>about 5 or >about 10 and a FPKMblood<about 1. In one embodiment, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with any cancer in general (e.g. pan-cancer), the NDR comprises at least one NDR of a gene which transcript shows a higher FPKM value in normal blood than in tumor. In some examples, the transcript has a FPKMblood>about 5 or >about 10 and a FPKMtumor<about 1. In various embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with any cancer in general (e.g. pan-cancer), the NDR consists of NDR(s) of gene(s) which transcript(s) shows a higher FPKM value in normal blood than in tumor.
In various embodiments, the one or more NDR comprises at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, at least about ten, at least about 11, at least about 12, at least about 13, at least about 14, at least about 15, at least about 16, at least about 17, at least about 18, at least about 19 or at least about 20 NDRs. In various embodiments, the one or more NDR comprises the NDR of at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, at least about ten genes, at least about 11, at least about 12, at least about 13, at least about 14, at least about 15, at least about 16, at least about 17, at least about 18, at least about 19 or at least about 20 genes or distinct genes.
In some embodiments, the one or more NDR comprises at least about two NDRs, optionally about six NDRs, further optionally about ten NDRs. In some embodiments, the one or more NDR comprises the NDR of at least about two genes (or distinct genes), optionally about six genes (or distinct genes), further optionally about ten genes (or distinct genes). In some embodiments, the one or more NDR comprises at least about four NDRs or the NDRs of at least about four genes or distinct genes. In some embodiments, the one or more NDR comprises no more than about nine NDRs or NDRs of no more than about nine genes or distinct genes. In some embodiments, the one or more NDR comprises about four to about nine NDRs or NDRs of about four to about nine genes or distinct genes. In some embodiments, the one or more NDR comprises about six NDRs or NDRs of about six genes or distinct genes. In some embodiments, the one or more NDR comprises no more than about 13 NDRs or NDRs of no more than about 13 genes or distinct genes. In some embodiments, the one or more NDR comprises about nine, about 10, about 11, about 12 or about 13 NDRs or NDRs of about nine, about 10, about 11, about 12 or about 13 genes or distinct genes. The suitable number of NDRs, genes or features may be further varied, and is within the purview of a person skilled in the art. The number or the reasonable range of numbers of NDRs, genes or features may be determined, for example, by checking an error evolution with the number of top predictive genes or features (e.g. genes or features that are selected most frequently as being predictive by a machine learning model in multiple iterations).
In various embodiments, the NDRs/genes comprises one or more NDRs/genes listed in one or more of Table 1, Table 2, Table S3, Table S10, Table S14, Table S15, Table S16, Table S17, Table S18, Table S19, Table S20 and Table S21.
In various embodiments, the NDRs/genes comprises one or more, or at least about one, at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, or at least about ten of the following genes/associated NDRs (e.g. a transcription start site, a promoter, an intron-exon junction and/or an exon-intron junction): ABHD5, ABTB1, ACAP1, AC01, ACRBP, ACSL1, ADAM8, ADIRF, AGR2, AGR3, AHSP, AK2, AKNA, ALAS2, ALDH18A1, ALOX5, ANKS4B, ANPEP, AOAH, APOBEC3A, ARAP1, ARHGAP25, ARHGAP26, ARHGAP30, ARHGAP9, ARHGEF16, ARHGEF35, ARIDSA, ARRB2, ARSE, ATG16L2, ATP2A2, ATP2C2, ATP5G1, ATP5G3, ATP6V1B2, AXIN2, AZGP1, AZU1, B3GNT3, BATF2, BCAR1, BCL2L15, BCL2L2, BCL6, BDH1, BDH2, BEST1, BGN, BIN2, BIRCS, BMP4, BMX, BOK, BPI, BSPRY, BTK, BTNL8, C10orf54, C11orf21, C16orf54, C19orf33, C19orf35, C1orf162, C1orf210, C1orf228, C1QTNF5, C3, C5AR2, C6orf203, C6orf25, C8orf59, CA1, CA4, CALD1, CAMP, CAPNS, CARS2, CCDC88B, CCL20, CCM2, CCND3, CCR7, CD177, CD244, CD276, CD300E, CD300LB, CD300LF, CD37, CD44, CD53, CD55, CDC42SE1, CDCP1, CDH1, CDH17, CDHRS, CDK4, CDK5RAP2, CDX1, CEACAM1, CEACAM3, CEACAM4, CEACAM5, CELF2, CENPF, CFD, CFP, CFTR, CHCHD6, CHID1, CKB, CKMT1B, CLC, CLDN7, CLEC12A, CLEC4D, CLEC4E, CMTM2, CNN2, CORO1A, CORO7, COTL1, COX6C, CR1, CRB3, CSF3R, CTGF, CTNND1, CXCR1, CXCR2, CYBA, CYTH4, DDC, DDR1, DDX10, DEF8, DEFA1, DEFA1B, DEFA3, DEFA4, DENND1C, DENND3, DHRS13, DHX34, DMTN, DNAH17, DOCK2, DOK3, DPEP2, DYSF, ECE1, ECT2, EEF1E1, EFNA3, EGLN2, E124, ELANE, ELF3, EMP1, ENTPD2, ENTPD6, EPB42, EPCAM, EPHA2, EPS8L3, ERBB2, ERBB3, EVI2B, F3, FAM101A, FAM109A, FAM212B, FAM213B, FAM49B, FAM60A, FAM65B, FAM83E, FAM84A, FBL, FBXL5, FCAR, FCGR2A, FCGR3B, FCN1, FERMT1, FERMT3, FES, FFAR2, FGD3, FGFR4, FGFRL1, FGR, FKBP8, FLOT2, FMNL1, FN1, FOLR3, FOXA2, FPR1, FPR2, FUT2, FUT6, FUT7, GATA1, GBP3, GCA, GGT6, GJB2, GLRX3, GLT1D1, GMFG, GMNN, GNG2, GNLY, GOLT1A, GP9, GPC4, GPR35, GPRCSA, GPSM3, GPX2, GRAMD1A, GRAP2, GRTP1, GZMH, H2AFY2, HBA1, HBA2, HBB, HBD, HBG1, HBG2, HBM, HBQ1, HCK, HCLS1, HID1, HK3, HKDC1, HMBS, HMGB3, HMGCS2, HN1L, HNF4A, HTRA1, ICAM3, IFI30, IFITM1, IFITM2, IFT172, IKZF1, IL16, IL18RAP, IL1R2, IL1 RN, IL2RG, IL32, ILVBL, IMPDH1, INPP5D, IPO5, ITGA2B, ITGAL, ITGAM, ITGAX, ITGB2, ITGB4, JAK3, JUND, JUP, KCNAB2, KCNE1, KIAA1191, KIFC1, KLF1, LAD1, LAMB2, LAMC2, LAPTM5, LCP2, LDHA, LGALS3BP, LGALS4, LGMN, LILRA1, LILRA5, LILRB2, LILRB3, LIMD2, LIPH, LMNB1, LMO7, LPAR2, LRCH4, LRRC25, LSP1, LSR, LST1, LTB, LYL1, MACROD1, MAGED1, MAN2A2, MAP1LC3A, MEFV, MEP1A, MFAP4, MIS18A, MISP, MKNK1, MLKL, MMAB, MME, MMP11, MMP25, MMP7, MMP8, MORN2, MPO, MPP1, MPZL2, MRPL17, MSL3, MSRB1, MTIF2, MUC13, MX2, MXD3, MYL4, MYO1A, MYO1F, MYO1G, MZT2A, NABP1, NADK, NAIP, NAMPT, NARF, NBEAL2, NCF1, NCF2, NCF4, NDEL1, NDUFAF4, NDUFB5, NEK2, NFAM1, NFE2, NKG7, NLRC4, NLRC5, NLRP12, NNMT, NOX1, NQO1, NTMT1, NTPCR, NUPR1, OAZ1, ORM1, OSCAR, P2RX1, PADI2, PADI4, PALLD, PARVG, PDHA1, PDHX, PGD, PGLYRP1, PHC2, PHF21A, PHGR1, PHOSPHO1, PIK3R5, PIN4, PLB1, PLBD1, PLCB2, PLCD3, PLCG2, PLEKHA1, PLS3, POF1B, POLR2H, POSTN, PPBP, PPIL1, PPM1M, PPP1R16A, PPP1R1B, PRAM1, PRAP1, PREX1, PRKCB, PROCR, PROK2, PRR15, PRR15L, PRRC1, PRSS22, PRSS8, PRTN3, PSTPIP1, PTK2B, PTPN6, PYCR1, PYGL, R3HDM4, RAB24, RAB25, RAC2, RARRES2, RASAL3, RASGRP2, RASGRP4, RASSF2, REG4, RELT, REM2, RETN, RFC3, RGL4, RGS19, RHOD, RHPN2, RIN3, RNASE4, RND3, RNF166, RNF167, RPL41, RPS6KA1, RPS9, S100A12, S100A14, S100A8, S1PR4, SAP25, SASH3, SCNN1A, SCOC, SDCBP2, SEC11C, SECTM1, SELL, SEMA4D, SEPP1, SEPTIN10, SFN, SHKBP1, SIGLEC5, SIPA1, SIRPB1, SLC11A1, SLC12A9, SLC16A3, SLC25A37, SLC2A3, SLC2A8, SLC38A5, SLC39A5, SLC43A2, SLCO3A1, SMAP2, SMIM22, SNCA, SORD, SORL1, SPATS2L, SPI1, SRC, ST20, STAP2, STARD10, STAT5B, STEAP1, STK10, STX11, STXBP2, SULT2B1, SYTL3, TACC3, TAGAP, TALDO1, TBC1 D10C, TBX21, TBXAS1, TCEAL4, TFF3, THBS2, THEMIS2, TIMM8B, TINAGL1, TJP3, TLR6, TM4SF5, TMBIM6, TMC4, TMCC2, TMEM106C, TMEM126B, TMEM14A, TMEM14C, TMEM71, TMEM91, TMEM97, TMPRSS2, TMPRSS4, TNFRSF10C, TNS3, TOP2A, TPM1, TRAF3IP3, TREM1, TREML2, TRIM22, TRIM25, TSKU, TSPAN15, TSPAN8, TUBA4A, TUBB1, TYMS, TYROBP, UBE2C, UBE2D3, UBE2T, UGT8, UNC13D, UQCC2, URI1, USB1, USH1C, VARS, VASP, VAV1, VMP1, VNN2, VNN3, VPS51, VSTM1, VWA1, WAS, WDR12, WFS1, XPO6, ZAP70, ZDHHC18, ZDHHC19, ZDHHC9, ZNF467, ZWINT and parts thereof.
In some embodiments, the one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SHKBP1, ACSL1, BCAR1, RAB25, PRTN3, LSR, SLC11A1, NLRP12, HMBS, LILRB3, GP9, MX2, RASGRP4, ATG16L2 and combinations thereof. In some embodiments, the one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SLC11A1, NLRP12, PRTN3, HMBS, LILRB3, ACSL1, GP9, MX2, RASGRP4, ATG16L2 and combinations thereof. In some embodiments, the one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SHKBP1, ACSL1, BCAR1, RAB25, PRTN3, LSR and combinations thereof.
In various embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with a specific cancer type or a specific group of cancers, the NDR comprises one or more, or at least about one, at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, or at least about ten of the following genes/associated NDRs (e.g. a transcription start site, a promoter, an intron-exon junction and/or an exon-intron junction): ABTB1, ACAP1, ACO1, ACSL1, ADIRF, AGR2, AGR3, AK2, AKNA, ALDH18A1, ANKS4B, ARAP1, ARHGAP25, ARHGAP30, ARHGAP9, ARHGEF16, ARHGEF35, ARIDSA, ARRB2, ARSE, ATG16L2, ATP2A2, ATP2C2, ATP5G1, ATP5G3, AXIN2, AZGP1, B3GNT3, BATF2, BCAR1, BCL2L15, BCL2L2, BCL6, BDH1, BDH2, BEST1, BGN, BIN2, BIRCS, BMP4, BOK, BSPRY, C10orf54, C11orf21, C16orf54, C19orf33, C1orf162, C1orf210, C1QTNF5, C3, C5AR2, C6orf203, C8orf59, CA1, CALD1, CAMP, CAPNS, CCL20, CCM2, CCR7, CD177, CD276, CD300LF, CD37, CD44, CD55, CDC42SE1, CDCP1, CDH1, CDH17, CDHRS, CDK4, CDK5RAP2, CDX1, CEACAM1, CEACAM4, CEACAM5, CENPF, CFD, CFTR, CHCHD6, CHID1, CKB, CKMT1B, CLDN7, CLEC4E, CORO1A, COTL1, COX6C, CR1, CRB3, CSF3R, CTGF, CTNND1, CXCR1, CXCR2, DDC, DDR1, DDX10, DENND1C, DMTN, DOK3, DPEP2, ECT2, EEF1E1, EFNA3, E124, ELF3, EMP1, ENTPD2, ENTPD6, EPCAM, EPHA2, EPS8L3, ERBB2, ERBB3, EVI2B, F3, FAM101A, FAM109A, FAM212B, FAM213B, FAM60A, FAM65B, FAM83E, FAM84A, FBL, FBXL5, FCAR, FCGR2A, FCN1, FERMT1, FFAR2, FGD3, FGFR4, FGFRL1, FGR, FMNL1, FN1, FOLR3, FOXA2, FPR1, FUT2, FUT6, GATA1, GBP3, GGT6, GJB2, GLRX3, GLT1D1, GMFG, GMNN, GNLY, GOLT1A, GPC4, GPR35, GPRCSA, GPX2, GRTP1, GZMH, H2AFY2, HBB, HBD, HBG2, HBM, HBQ1, HCK, HID1, HK3, HKDC1, HMGB3, HMGCS2, HN1L, HNF4A, HTRA1, ICAM3, IFITM1, IFITM2, IFT172, IKZF1, IL1R2, IL1 RN, IL32, ILVBL, IPO5, ITGA2B, ITGAM, ITGB4, JUND, JUP, KIAA1191, KIFC1, LAD1, LAMB2, LAMC2, LDHA, LGALS3BP, LGALS4, LGMN, LILRB2, LILRB3, LIMD2, LIPH, LMNB1, LMO7, LRRC25, LSR, LST1, MACROD1, MAGED1, MAP1LC3A, MEP1A, MFAP4, MIS18A, MISP, MKNK1, MMAB, MME, MMP11, MMP25, MMP7, MMP8, MORN2, MPP1, MPZL2, MRPL17, MSRB1, MTIF2, MUC13, MXD3, MYL4, MYO1A, MYO1F, MZT2A, NAMPT, NCF1, NCF2, NCF4, NDUFAF4, NDUFB5, NEK2, NFAM1, NFE2, NKG7, NNMT, NOX1, NQO1, NTMT1, NTPCR, NUPR1, OAZ1, ORM1, OSCAR, P2RX1, PADI4, PALLD, PDHA1, PDHX, PGLYRP1, PHC2, PHGR1, PHOSPHO1, PIK3R5, PIN4, PLCD3, PLEKHA1, PLS3, POF1B, POLR2H, POSTN, PPIL1, PPM1M, PPP1R16A, PPP1R1B, PRAM1, PRAP1, PRKCB, PROCR, PRR15, PRR15L, PRRC1, PRSS22, PRSS8, PRTN3, PSTPIP1, PTPN6, PYCR1, RAB25, RARRES2, RASGRP2, RASGRP4, RASSF2, REG4, RETN, RFC3, RHOD, RHPN2, RIN3, RNASE4, RND3, RPL41, RPS6KA1, RPS9, S100A12, S100A14, S100A8, S1PR4, SCNN1A, SCOC, SDCBP2, SEC11C, SEPP1, SEPTIN10, SFN, SHKBP1, SIGLEC5, SIRPB1, SLC11A1, SLC25A37, SLC2A8, SLC39A5, SMIM22, SNCA, SORD, SPATS2L, SPI1, SRC, STAP2, STARD10, STEAP1, STX11, SULT2B1, SYTL3, TBC1D10C, TCEAL4, TFF3, THBS2, THEMIS2, TIMM8B, TINAGL1, TJP3, TM4SF5, TMBIM6, TMC4, TMEM106C, TMEM126B, TMEM14A, TMEM14C, TMEM71, TMEM91, TMEM97, TMPRSS2, TMPRSS4, TNFRSF10C, TNS3, TOP2A, TPM1, TRAF3IP3, TREM1, TRIM22, TSKU, TSPAN15, TSPAN8, TYMS, TYROBP, UBE2C, UBE2T, UGT8, UQCC2, URI1, USH1C, VARS, VAV1, VMP1, VNN2, VPS51, VSTM1, VWA1, WAS, WDR12, WFS1, XPO6, ZAP70, ZDHHC19, ZDHHC9, ZNF467 and ZWINT.
In some embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with a specific cancer type or a specific group of cancers, the NDR comprises one or more, or at least about one, at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, or at least about ten of the following genes/associated NDRs (e.g. a transcription start site, a promoter, an intron-exon junction and/or an exon-intron junction): ACAP1, ACSL1, ADIRF, ANKS4B, ARHGAP30, ARSE, ATP5G3, BCAR1, BCL6, BGN, BIN2, BMP4, C19orf33, C1orf162, C5AR2, CCR7, CD276, CD37, CD44, CDC42SE1, CDH17, CDK5RAP2, CHCHD6, CKB, CLDN7, CLEC4E, CTGF, DDX10, ELF3, ERBB3, F3, FAM101A, FAM65B, FAM84A, FBXL5, FCAR, FCN1, FERMT1, FFAR2, FOLR3, FOXA2, FUT2, GMFG, GPRC5A, GPX2, HBB, HBD, HID1, LAMC2, LDHA, LGALS4, LGMN, LIMD2, LRRC25, LSR, MAGED1, MPZL2, MRPL17, MXD3, MYO1A, NCF1, NCF2, NFE2, OAZ1, PHOSPHO1, PLCD3, POF1B, POSTN, PPP1R16A, PRAP1, PRR15L, PRSS8, PRTN3, RAB25, RASGRP4, RFC3, S100A12, SCOC, SDCBP2, SEPP1, SHKBP1, SLC11A1, SORD, SRC, STAP2, STARD10, STX11, SYTL3, TCEAL4, TFF3, TM4SF5, TMC4, TMEM126B, TMPRSS2, TNFRSF10C, TRAF3IP3, TREM1, TRIM22, TYMS, TYROBP, UBE2C, UGT8, UQCC2, VAV1, VNN2, WAS and ZDHHC9.
In some embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with a specific cancer type or a specific group of cancers, the NDR comprises one or more, or at least about one, at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, or at least about ten of the following genes/associated NDRs (e.g. a transcription start site, a promoter, an intron-exon junction and/or an exon-intron junction): ACSL1, ANKS4B, ARHGAP30, ATP5G3, B3GNT3, BCL6, BIN2, BMP4, C19orf33, C1orf162, CD37, CLEC4E, ERBB3, FBXL5, FCAR, FCN1, FERMT1, FFAR2, FOXA2, GMFG, HBB, HID1, ICAM3, LGALS4, LGMN, LSR, MXD3, MYO1A, NCF1, NCF2, NFE2, OAZ1, PHOSPHO1, PLCD3, PRAP1, PRSS8, PRTN3, RAB25, RASGRP4, SCOC, SDCBP2, SEPP1, SHKBP1, SYTL3, TFF3, TM4SF5, TMC4, TMPRSS2, TRAF3IP3, TRIM22, TYROBP, UGT8, VAV1 and WAS.
In some embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with a specific cancer type or a specific group of cancers, the NDR comprises the following: first exon-intron junction of SHKBP1, first exon-intron junction of ACSL1, first exon-intron junction of BCAR1, promoter of RAB25, promoter of PRTN3 and/or promoter of LSR.
In some embodiments, the method further comprises assigning the most weight to the level of cfDNA that maps to the first exon-intron junction of SHKBP1 and less weight to the level of cfDNA that maps to the other NDR(s) when estimating the ctDNA burden. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the first exon-intron junction of SHKBP1 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the first exon-intron junction of ACSL1, the first exon-intron junction of BCAR1, the promoter of RAB25, the promoter of LSR and the promoter of PRTN3. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the first exon-intron junction of ACSL1 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the first exon-intron junction of BCAR1, the promoter of RAB25, the promoter of LSR and the promoter of PRTN3. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the first exon-intron junction of BCAR1 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of RAB25, the promoter of LSR and the promoter of PRTN3. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of RAB25 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of LSR and the promoter of PRTN3. In various embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of LSR and relatively less weight to the level of cfDNA that maps to the promoter of PRTN3. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to a first exon-intron junction and relatively less weight to the level of cfDNA that maps to a promoter.
In various embodiments, the specific cancer type or specific group of cancers comprises colorectal cancer.
In various embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with any cancer or cancer in general (pan cancer), the NDR comprises one or more, or at least about one, at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, or at least about ten of the following genes/associated NDRs (e.g. a transcription start site, a promoter, an intron-exon junction and/or an exon-intron junction): ABHD5, ABTB1, ACAP1, ACRBP, ACSL1, ADAM8, AHSP, AKNA, ALAS2, ALOX5, ANPEP, AOAH, APOBEC3A, ARAP1, ARHGAP26, ARHGAP9, ARID5A, ARRB2, ATG16L2, ATP6V1B2, AZU1, BIN2, BMX, BPI, BTK, BTNL8, C11orf21, C19orf35, C1orf162, C1orf228, C6orf25, CA1, CA4, CAMP, CARS2, CCDC88B, CCND3, CD177, CD244, CD300E, CD300LB, CD37, CD44, CD53, CDK5RAP2, CEACAM3, CEACAM4, CELF2, CFP, CLC, CLEC12A, CLEC4D, CLEC4E, CMTM2, CNN2, CORO1A, CORO7, CR1, CSF3R, CXCR1, CXCR2, CYBA, CYTH4, DEF8, DEFA1, DEFA1B, DEFA3, DEFA4, DENND1C, DENND3, DHRS13, DHX34, DMTN, DNAH17, DOCK2, DOK3, DYSF, ECE1, EGLN2, ELANE, EPB42, FAM49B, FAM65B, FBXL5, FCAR, FCGR2A, FCGR3B, FCN1, FERMT3, FES, FFAR2, FGD3, FGR, FKBP8, FLOT2, FMNL1, FOLR3, FPR2, FUT7, GATA1, GCA, GNG2, GNLY, GP9, GPSM3, GRAMD1A, GRAP2, HBA1, HBA2, HBB, HBD, HBG1, HBG2, HBM, HBQ1, HCK, HCLS1, HK3, HMBS, ICAM3, IFI30, IFITM1, IFITM2, IL16, IL18RAP, IL1R2, IL2RG, IMPDH1, INPP5D, ITGA2B, ITGAL, ITGAM, ITGAX, ITGB2, JAK3, KCNAB2, KCNE1, KLF1, LAPTM5, LCP2, LILRA1, LILRA5, LILRB2, LILRB3, LPAR2, LRCH4, LSP1, LST1, LTB, LYL1, MAN2A2, MEFV, MKNK1, MLKL, MMP25, MMP8, MPO, MPP1, MSL3, MSRB1, MX2, MXD3, MYL4, MYO1F, MYO1G, NABP1, NADK, NAIP, NAMPT, NARF, NBEAL2, NCF1, NCF2, NDEL1, NFE2, NLRC4, NLRC5, NLRP12, P2RX1, PADI2, PADI4, PARVG, PGD, PGLYRP1, PHF21A, PHOSPHO1, PIK3R5, PLB1, PLBD1, PLCB2, PLCG2, PPBP, PRAM1, PREX1, PROK2, PRTN3, PSTPIP1, PTK2B, PTPN6, PYGL, R3HDM4, RAB24, RAC2, RASAL3, RASGRP2, RASGRP4, RELT, REM2, RGL4, RGS19, RIN3, RNF166, RNF167, SAP25, SASH3, SECTM1, SELL, SEMA4D, SHKBP1, SIGLEC5, SIPA1, SIRPB1, SLC11A1, SLC12A9, SLC16A3, SLC2A3, SLC38A5, SLC43A2, SLCO3A1, SMAP2, SORL1, SPI1, ST20, STAT5B, STK10, STXBP2, TACC3, TAGAP, TALDO1, TBC1D10C, TBX21, TBXAS1, THEMIS2, TLR6, TMCC2, TMEM71, TNFRSF10C, TRAF3IP3, TREML2, TRIM25, TUBA4A, TUBB1, UBE2D3, UNC13D, USB1, VASP, VAV1, VMP1, VNN3, VSTM1, WAS, XPO6, ZAP70, ZDHHC18 and ZDHHC19.
In some embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with any cancer or cancer in general (pan cancer), the NDR comprises one or more, or at least about one, at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, or at least about ten of the following genes/associated NDRs (e.g. a transcription start site, a promoter, an intron-exon junction and/or an exon-intron junction): ABTB1, ACAP1, ACSL1, ARHGAP9, ATG16L2, ATP6V1B2, BIN2, BTK, BTNL8, C19orf35, CA4, CD37, CDK5RAP2, CEACAM4, CFP, CLEC12A, CLEC4D, CLEC4E, CSF3R, CXCR2, CYTH4, DEF8, DENND1C, DENND3, DHRS13, DOK3, FAM49B, FBXL5, FCGR2A, FCN1, FES, FFAR2, FKBP8, FMNL1, FOLR3, FUT7, GNG2, GP9, GPSM3, HBD, HK3, HMBS, IFI30, IL16, IL1R2, ITGA2B, JAK3, KCNE1, LCP2, LILRB2, LILRB3, LYL1, MAN2A2, MKNK1, MLKL, MPO, MX2, MYO1F, NCF1, NFE2, NLRP12, PADI2, PADI4, PARVG, PGLYRP1, PHOSPHO1, PREX1, PRTN3, PSTPIP1, RAC2, RASAL3, RASGRP4, RELT, RNF166, RNF167, SHKBP1, SLC11A1, SLC12A9, SLC16A3, SLCO3A1, SORL1, SPI1, TBC1D10C, TBXAS1, USB1, VAV1, VSTM1, XPO6 and ZDHHC18.
In some embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden associated with any cancer or cancer in general (pan cancer), the NDR comprises one or more, or at least about one, at least about two, at least about three, at least about four, at least about five, at least about six, at least about seven, at least about eight, at least about nine, or at least about ten of the following genes/associated NDRs (e.g. a transcription start site, a promoter, an intron-exon junction and/or an exon-intron junction): ABTB1, ACAP1, ACSL1, ATG16L2, ATP6V1B2, BTK, BTNL8, C19orf35, CEACAM4, CLEC4E, CSF3R, DENND1C, DENND3, DHRS13, FBXL5, FCAR, FCN1, FFAR2, FKBP8, FMNL1, GNG2, GP9, GPSM3, HBD, HMBS, IFI30, IL18RAP, ITGA2B, LCP2, LILRB3, LYL1, MAN2A2, MKNK1, MPO, MX2, MXD3, MYO1F, NFE2, NLRP12, PADI4, PHOSPHO1, PREX1, RASGRP2, RASGRP4, RGL4, RNF166, RNF167, SHKBP1, SLC11A1, SLC12A9, TBC1 D10C, TBXAS1, USB1 and VSTM1.
In some embodiments, where the method comprises a method of estimating a ctDNA burden or tumor burden with any cancer or cancer in general (pan cancer), the NDR comprises the following: promoter of SLC11A1, promoter of NLRP12, promoter of PRTN3, promoter of HMBS, promoter of LILRB3, first exon-intron junction of ACSL1, first exon-intron junction of GP9, promoter of MX2, promoter of RASGRP4 and/or promoter of ATG16L2.
In some embodiments, the method further comprises assigning the most weight to the level of cfDNA that maps to the promoter of HMBS and/or the first exon-intron junction of GP9 and relatively less weight to the level of cfDNA that maps to the other NDR(s) when estimating the ctDNA burden. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of HMBS and/or the first exon-intron junction of GP9 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of RASGRP4, the promoter of NLRP12, the promoter of ATG16L2, the promoter of SLC11A1, the promoter of LILRB3, the promoter of PRTN3, the promoter of MX2 and the first exon-intron junction of ACSL1. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of RASGRP4 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of NLRP12, the promoter of ATG16L2, the promoter of SLC11A1, the promoter of LILRB3, the promoter of PRTN3, the promoter of MX2 and the first exon-intron junction of ACSL1. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of NLRP12 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of ATG16L2, the promoter of SLC11A1, the promoter of LILRB3, the promoter of PRTN3, the promoter of MX2 and the first exon-intron junction of ACSL1. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of ATG16L2 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of SLC11A1, the promoter of LILRB3, the promoter of PRTN3, the promoter of MX2 and the first exon-intron junction of ACSL1. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of SLC11A1 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of LILRB3, the promoter of PRTN3, the promoter of MX2 and the first exon-intron junction of ACSL1. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of LILRB3 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of PRTN3, the promoter of MX2 and the first exon-intron junction of ACSL1. In some embodiments, the method comprises assigning more weight to the level of cfDNA that maps to the promoter of PRTN3 and relatively less weight to the level of cfDNA that maps to one or more of the following NDRs: the promoter of MX2 and the first exon-intron junction of ACSL1. In some embodiments, the method comprises assigning similar weights to the level of cfDNA that maps to the promoter of HMBS and the level of cfDNA that maps to the first exon-intron junction of GP9. In some embodiments, the method comprises assigning similar weights to the level of cfDNA that maps to the promoter of MX2 and the level of cfDNA that maps to the first exon-intron junction of ACSL1.
In various embodiments, the total length/size of the one or more NDR is no more than about 100 kilobase pairs (kb), no more than about 90 kb, no more than about 80 kb, no more than about 70 kb, no more than about 60 kb, no more than about 50 kb, no more than about 30 kb, no more than about 20 kb or more than about 10 kb. In some embodiments, the total length/size of the one or more NDR is no more than about 30 kb. In some embodiments, the total length/size of the one or more NDR is no more than about 25 kb. In some embodiments, the total length/size of the one or more NDR is about 24 kb.
In various embodiments, the method does not comprise sequencing one or more regions that collectively spans more than about 100 kb, more than about 95 kb, more than about 90 kb, more than about 85 kb, more than about 80 kb, more than about 75 kb, more than about 70 kb, more than about 65 kb, more than about 60 kb, more than about 55 kb, more than about 50 kb, more than about 45 kb, more than about 40 kb, more than about 35 kb, more than about 30 kb, more than about 25 kb, more than about 20 kb, more than about 15 kb, more than about 10 kb or more than about 5 kb in length. In various embodiments, the method does not comprise sequencing a continuous/contiguous region that spans more than about 4 kb, more than about 5 kb, more than about 6 kb, more than about 7 kb, more than about 8 kb, more than about 9 kb or more than about 10 kb in length. In various embodiments, the method does not comprise whole genome sequencing of the cfDNA. Advantageously, embodiments of the method are efficient in terms of time and resources, and provide a fast turnaround time.
As may be appreciated, by following the teachings herein/carrying out the steps of this disclosure, a person skilled in the art will also be able to identify further genomic regions (including non-coding regions), other than the ones highlighted in this disclosure, that are also predictive of the disease burden, the cancer burden, the tumor burden, the ctDNA burden, the level of ctDNA, an amount of ctDNA, the proportion of ctDNA, the fraction of ctDNA and the ctDNA content. Hence, the genes/regions highlighted in this disclosure are non-exhaustive. Indeed, a person skilled in the art would understand that the genomic regions that may be used to predict the disease burden, the cancer burden, the tumor burden, the ctDNA burden, the level of ctDNA, an amount of ctDNA, the proportion of ctDNA, the fraction of ctDNA and/or the ctDNA content are not limited to the particular gene-encoding regions described herein, and may also include non-coding regions (including regions that are far away from genes e.g. regulatory regions such as enhancers).
In various embodiments, the method further comprises removing particulate blood components from the sample (e.g. a blood sample) to leave behind blood plasma for use in the determining step. In various embodiments, plasma is separated from blood shortly after (e.g. within about 2 hours of) venipuncture. In various embodiments, plasma is separated from blood by centrifugation e.g. at 10 min×300 g and 10 min×9370 g). In various embodiments, the plasma is stored at low temperature e.g. at −80° C. after separation. In various embodiments, the particulate blood components are selected from the group consisting red blood cells, white blood cells, platelets and combinations thereof. In various embodiments, the method further comprising extracting/isolating/purifying the cfDNA from the sample/blood plasma.
In various embodiments, the method requires no more than about 20 milliliters, no more than about 19.5 milliliters, no more than about 19 milliliters, no more than about 18.5 milliliters, no more than about 18 milliliters, no more than about 17.5 milliliters, no more than about 17 milliliters, no more than about 16.5 milliliters, no more than about 16 milliliters, no more than about 15.5 milliliters, no more than about 15 milliliters, no more than about 14.5 milliliters, no more than about 14 milliliters, no more than about 13.5 milliliters, no more than about 13 milliliters, no more than about 12.5 milliliters, no more than about 12 milliliters, no more than about 11.5 milliliters, no more than about 11 milliliters, no more than about 10.5 milliliters, no more than about 10 milliliters, no more than about 9.5 milliliters, no more than about 9 milliliters, no more than about 8.5 milliliters, no more than about 8 milliliters, no more than about 7.5 milliliters, no more than about 7 milliliters, no more than about 6.5 milliliters, no more than about 6 milliliters, no more than about 5.5 milliliters, no more than about 5 milliliters, no more than about 4.5 milliliters, no more than about 4 milliliters, no more than about 3.5 milliliters, no more than about 3 milliliters, no more than about 2.5 milliliters, no more than about 2 milliliters, no more than about 1.5 milliliters, no more than about 1 milliliters, no more than about 0.9 milliliters, no more than about 0.8 milliliters, no more than about 0.7 milliliters, no more than about 0.6 milliliters, no more than about 500 microliters, no more than about 450 microliters, no more than about 400 microliters, no more than about 350 microliters or no more than about 300 microliters of sample.
In various embodiments, the method further comprises obtaining the sample from the subject prior to the determining step. In various embodiments, the step of obtaining the sample from the subject is a non-surgical step, a non-invasive step or a minimally invasive step. In various embodiments, the step of obtaining the sample from the subject comprises withdrawing a blood sample from the subject.
In various embodiments, the method is capable of precisely estimating one or more of: a disease burden, a cancer burden, a tumor burden, a ctDNA burden, a level of ctDNA, an amount of ctDNA, a proportion of ctDNA, a fraction of ctDNA and a ctDNA content such that the estimated disease burden, cancer burden, tumor burden, ctDNA burden, level of ctDNA, amount of ctDNA, proportion of ctDNA, fraction of ctDNA and/or ctDNA content has an absolute deviation/absolute error/mean absolute deviation/mean absolute error of no more than about 5.0%, no more than about 4.9%, no more than about 4.8%, no more than about 4.7%, no more than about 4.6%, no more than about 4.5%, no more than about 4.4%, no more than about 4.3%, no more than about 4.2%, no more than about 4.1%, no more than about 4.0%, no more than about 3.9%, no more than about 3.8%, no more than about 3.7%, no more than about 3.6%, no more than about 3.5%, no more than about 3.4%, no more than about 3.3%, no more than about 3.2%, no more than about 3.1%, no more than about 3%, no more than about 2.9%, no more than about 2.8%, no more than about 2.7%, no more than about 2.6%, no more than about 2.5%, no more than about 2.4%, no more than about 2.3%, no more than about 2.2%, no more than about 2.1%, no more than about 2%, no more than about 1.9%, no more than about 1.8%, no more than about 1.7%, no more than about 1.6%, or no more than about 1.5% from a true/expected/measured disease burden, cancer burden, tumor burden, ctDNA burden, level of ctDNA, amount of ctDNA, proportion of ctDNA, fraction of ctDNA and/or ctDNA content.
In various embodiments, the method has a predictive accuracy of at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9% or at least about 100%.
In various embodiments, the method comprises a machine learning-based method.
In various embodiments, the method further comprises training a machine learning model with a first training data set defining a level (or an amount, a proportion, a fraction and/or a content) of DNA, optionally cfDNA, that maps to (or aligns with, corresponds to, belongs to, is similar to and/or is identical to) the one or more NDR (e.g. in the form of a read depth coverage) as features and a measured disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content; and selecting a first set of one or more features that is predictive of the disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content.
In various embodiments, a subset of samples may be randomly selected from a training data set to train the machine learning model to identify the most predictive features. In various embodiments, the foregoing may be repeated independently multiple times, e.g. 1000 times, and the time(s) each feature is chosen as a predictive feature is counted. In various embodiments, the feature(s) that is/are selected most frequently is/are extracted to train a final model comprising all samples in the training data set (e.g. the first training data set) to identify one or more features (e.g. the first set of one or more features) that is predictive of the disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content. In various embodiments, cross validation (e.g. five-fold cross validation, eight-fold cross validation, ten-fold cross validation) is carried out during the machine learning process for identifying the most predictive features.
In various embodiments, the selecting step further comprises employing a linear model/regression, optionally a sparse linear model/regression, further optionally a Lasso (least square absolute shrinkage and selection operator) model to identify the first set of one or more features that is predictive of the disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content.
In various embodiments, the method further comprises providing a test data set to the trained machine learning model, the test data set defining at least the first set of one or more selected features; and estimating the disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content based on at least the first set of one or more selected features.
In various embodiments, the method further comprises comparing the estimated disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content with a true/expected/measured disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content of the test data set; and calculating an absolute deviation/absolute error/mean absolute deviation/mean absolute error between the estimated and the true/expected/measured disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content to evaluate a performance/prediction accuracy of the model.
In various embodiments, the method further comprises obtaining/collecting blood samples comprising cfDNA from cancer patients and healthy individuals; measuring a level (or an amount, a proportion, a fraction and/or a content) of DNA that maps to (or aligns with, corresponds to, belongs to, is similar to and/or is identical to) the one or more NDR (e.g. in the form of a read depth coverage) to obtain the features of the first training data set; and measuring a disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content (e.g. by whole genome sequencing, deep whole genome sequencing etc.) to obtain the disease burden/tumor burden/ctDNA burden/ctDNA level/ctDNA amount/ctDNA proportion/ctDNA fraction/ctDNA content of the first training data set. In various embodiments, the method further comprises determining/measuring an expression of the plurality of genes associated with the NDR in the blood samples.
In various embodiments, the method further comprises obtaining/collecting tumor/tumor biopsy samples from cancer patients; extracting/isolating/purifying nucleic acids from the tumor/tumor biopsy samples; measuring from the nucleic acids a level (or an amount, a proportion, a fraction and/or a content) of DNA that maps to (or aligns with, corresponds to, belongs to, is similar to and/or is identical to) the one or more NDR (e.g. in the form of a read depth coverage); and measuring from the nucleic acids an expression of the genes associated with the one or more NDR in the tumor/tumor biopsy samples.
In various embodiments, the method further comprises comparing said level (or an amount, a proportion, a fraction and/or a content) of DNA (e.g. in the form of a read depth coverage) and/or the expression of the genes between the blood samples and the tumor/tumor biopsy samples; identifying genes that show substantially different level (or an amount, a proportion, a fraction and/or a content) of DNA (e.g. substantially different read depth coverages) and/or substantially different expressions between the blood samples and the tumor/tumor biopsy samples; and selecting the level (or an amount, a proportion, a fraction and/or a content) of DNA (e.g. in the form of a read depth coverage) of these identified genes in the blood sample as features to be input in the first training data set.
The method may further comprise removing the first set of one or more features from the first training data set to form a second training data set; and training the machine learning model with the second training data set to select a second set of one or more features that is predictive of the disease burden, the cancer burden, the tumor burden, the ctDNA burden, the level of ctDNA, an amount of ctDNA, the proportion of ctDNA, the fraction of ctDNA and/or the ctDNA content. These steps may be repeated one or more times to obtain a third, fourth, fifth etc. set of one or more features that is predictive of the disease burden, the cancer burden, the tumor burden, the ctDNA burden, the level of ctDNA, an amount of ctDNA, the proportion of ctDNA, the fraction of ctDNA and/or the ctDNA content by removing the second, third, fourth etc. set of one or more features from the second, third, fourth etc. training data set to form a third, fourth, fifth etc. training data set respectively.
In various embodiments, the method further comprises screening for/detecting a tumor-specific mutation in the cfDNA/ctDNA present in the blood sample. Advantageously, embodiments of the method simultaneously allow for profiling of actionable cancer mutations and quantitative estimation of the disease burden, the cancer burden, the tumor burden, the ctDNA burden, the level of ctDNA, an amount of ctDNA, the proportion of ctDNA, the fraction of ctDNA and/or the ctDNA content. The method may be performed in combination with or complimentary to existing sequencing-based methods in cancer detection/monitoring.
In various embodiments, the method is an in vitro or ex vivo method.
In various embodiments, the method is a liquid biopsy method.
In various embodiments, the method is a method of determining disease progression in a subject and the method further comprises: determining in a subsequent blood sample obtained from the subject, a level of cfDNA that maps to one or more NDR; estimating the ctDNA burden or tumor burden based on said level of cfDNA; comparing the ctDNA burden or tumor burden estimated from said subsequent blood sample with the ctDNA burden or tumor burden estimated from said blood sample; and optionally identifying the subject as having disease progression if the ctDNA burden or tumor burden estimated from said subsequent blood sample is higher than the ctDNA burden or tumor burden estimated from said blood sample or identifying otherwise if the ctDNA burden or tumor burden estimated from said subsequent blood sample is not higher than the ctDNA burden or tumor burden estimated from said blood sample. In various embodiments, where the ctDNA burden or tumor burden estimated from said subsequent blood sample is lower than the ctDNA burden or tumor burden estimated from said blood sample, the disease identified to be improving/abating in the subject. In various embodiments, where the ctDNA burden or tumor burden estimated from said subsequent blood sample is substantially the same as the ctDNA burden or tumor burden estimated from said blood sample, the disease is identified to be stable in the subject.
Disease progression in a subject may be indicative of resistance to the current treatment regimen received by the subject. Thus, the method may also be useful for identifying resistance to treatment in a subject. In various embodiments therefore, the method further comprises changing the treatment regimen received by the subject if the subject is identified as having disease progression. Changing the treatment regimen may involve subjecting/exposing the subject to a second therapy that is different from the current or the first therapy. Changing the treatment regimen may involve replacing the current treatment regimen received by the subject with another treatment regimen, or it may involve administering to the subject additional therapies in addition to the current treatment regimen. In some embodiments, where a subject is already receiving combination therapy, changing the treatment regimen may also involve removing one or more therapies from the combination therapy. Examples of treatment regimens/therapies include, but are not limited to, chemotherapy, radiotherapy, gene therapy, hormonal therapy, immunotherapy, surgical therapy, combination therapy, alternative therapy/complementary therapy and combinations thereof. In various embodiments, changing the treatment regimen does not necessarily entail switching from one class of therapy (e.g. one of chemotherapy, radiotherapy, gene therapy, hormonal therapy, immunotherapy, surgical therapy, combination therapy, alternative therapy/complementary therapy and combinations thereof) to another class of therapy, although it may involve such a switch. Changing the treatment regimen may involve changing from one specific therapy to another specific therapy within the same therapy class. For example, changing the treatment regimen may involve changing the particular chemotherapy drug received by the subject.
In various embodiments, there is provided a method of monitoring disease progression in a subject, the method comprising: determining in a first sample comprising cfDNA obtained from the subject at a first time point, a first level (or an amount, a proportion, a fraction and/or a content) of cfDNA that maps to one or more NDR; estimating a first ctDNA burden or tumor burden (or a disease burden, a cancer burden, a level of ctDNA, an amount of ctDNA, a proportion of ctDNA, a fraction of ctDNA and a ctDNA content) in the subject based on the first level of cfDNA, determining in a second sample comprising cfDNA obtained from the subject at a second time point, a second level of DNA that maps to the one or more NDR, estimating a second ctDNA burden or tumor burden based on the second level of cfDNA; and comparing the first and the second estimated ctDNA burden or tumor burden to determine whether the disease has progressed, wherein the second time point is later than the first time point. In various embodiments, where the second estimated ctDNA burden or tumor burden is higher than the first estimated ctDNA burden or tumor burden, the disease is considered to have progressed/worsened. In various embodiments, where the second estimated ctDNA burden or tumor burden is lower than or is substantially the same as the first estimated ctDNA burden or tumor burden, the disease is considered to have abated or stabilized.
In various embodiments, there is provided a method of evaluating treatment efficacy/response in a subject, the method comprising: determining in a first sample comprising cfDNA obtained from the subject before/during a treatment/treatment stage, a first level (or an amount, a proportion, a fraction and/or a content) of cfDNA that maps to one or more NDR; estimating a first ctDNA burden or tumor burden (or a disease burden, a cancer burden, a level of ctDNA, an amount of ctDNA, a proportion of ctDNA, a fraction of ctDNA and a ctDNA content) in the subject based on the first level of cfDNA, determining in a second sample comprising cfDNA obtained from the subject after the treatment/treatment stage, a second level of DNA that maps to the one or more NDR, estimating a second ctDNA burden or tumor burden based on the second level of cfDNA; and comparing the first and the second estimated ctDNA burden or tumor burden to determine whether the treatment is effective/the subject is responding to the treatment. In various embodiments, where the second estimated ctDNA burden or tumor burden is higher than the first estimated ctDNA burden or tumor burden, the treatment is considered to be not effective or the subject is considered to be not responding to the treatment. In various embodiments, where the treatment is considered to be not effective or the subject is considered to be not responding to the treatment, the method further comprises adjusting/altering/stopping/halting/discontinuing the treatment regimen. In various embodiments, where the second estimated ctDNA burden or tumor burden is lower than or substantially the same as the first estimated ctDNA burden or tumor burden, the treatment is considered to be effective or the subject is considered to be responding to the treatment. In various embodiments, where the treatment is considered to be effective or the subject is considered to be responding to the treatment, the method further comprises continuing the treatment regimen.
In various embodiments, there is provided a method of determining a risk of cancer (e.g. a risk of development, predisposition, progression, relapse, recurrence, metastasis, abatement of cancer) in a subject, the method comprising: determining in a blood sample obtained from the subject, a level (or an amount, a proportion, a fraction and/or a content) of cfDNA that maps to one or more NDR, optionally estimating a disease burden (or a cancer burden, a tumor burden, a ctDNA burden, a level of ctDNA, an amount of ctDNA, a proportion of ctDNA, a fraction of ctDNA or a ctDNA content) based on said level of cfDNA, and determining the risk of cancer based on the level of cfDNA that maps to the one or more NDR, or the estimated disease burden. In various embodiments, where the level of cfDNA/the estimated disease burden exceeds a predetermined threshold level, the subject is concluded to have an elevated risk of cancer. In various embodiments, where the level of cfDNA/the estimated disease burden does not exceed the predetermined threshold level, the subject is concluded to have a reduced/low/minimal/no risk of cancer. It will be appreciated that it is within the purview of a person skilled in the art to determine the suitable threshold level. For example, the suitable threshold level may be determined by determining the mean level of cfDNA/the mean estimated disease burden of a healthy population e.g. a population that does not suffer from cancer.
In various embodiments, there is provided a method of treating cancer in a subject, the method comprising: determining in a blood sample obtained from the subject, a level of cfDNA that maps to one or more NDR and estimating a ctDNA burden or tumor burden based on said level of cfDNA. In various embodiments, where the level of cfDNA/the estimated ctDNA burden or tumor burden exceeds a predetermined threshold level, the subject is subjected to treatment selected from the group consisting of chemotherapy, radiotherapy, gene therapy, hormonal therapy, immunotherapy, surgical therapy, combination therapy, alternative therapy/complementary therapy and combinations thereof. It will be appreciated that it is within the purview of a person skilled in the art to determine the suitable threshold level. For example, the suitable threshold level may be determined by determining the mean level of cfDNA/the mean estimated ctDNA burden or tumor burden of a healthy population e.g. a population that does not suffer from cancer.
In various embodiments, there is provided a method of profiling a subject, the method comprising: determining in a blood sample obtained from the subject, a level of cfDNA that maps to one or more NDR and estimating a ctDNA burden or tumor burden based on said level of cfDNA.
In various embodiments, there is provided a kit/panel/probe set/primer set, optionally a kit/panel/probe set/primer set for estimating a tumor burden or ctDNA burden in a subject, the kit/panel/probe set/primer set comprising one or more probe/primer that is capable of hybridizing/binding to one or more NDR, where said NDR (i) comprises the NDR of a gene which transcript is differentially expressed between healthy blood tissue and tumor tissue and/or (ii) is degraded to different extents between healthy blood tissue and blood tissue of a tumor-bearing subject. In various embodiments, the one or more probe/primer is capable of hybridizing/binding to a central genomic region related to the one or more NDR. The size of the central genomic region may be about 1 kb, about 2 kb, about 3 kb, about 4 kb, about 5 kb, about 6 kb, about 7 kb, about 8 kb, about 9 kb or about 10 kb. In one example, a plurality of probes/primers hybridize/bind to an approximately 4 kb region centred at an NDR. The binding sites of a plurality of probes/primers to a central genomic region may be continuous or discontinuous within the central genomic region. In various embodiments, the one or more probe/primer has a sequence that is complementary to a sequence of the one or more NDR or parts thereof. In various embodiments, the one or more probe/primer has a sequence that is complementary to a sequence sharing at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98% or at least about 99% sequence identity with the one or more NDR or parts thereof. In various embodiments, the one or more probe/primer has a sequence that is complementary to a sequence that differs from the one or more NDR or parts thereof by about one, about two, about three, about four or about five nucleotides/bases. In various embodiments, the one or more probe/primer has a sequence that is complementary to a central genomic region or parts thereof related to the one or more NDR. In various embodiments, the one or more probe/primer has a sequence that is complementary to a sequence sharing at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98% or at least about 99% sequence identity with the central genomic region or parts thereof. In various embodiments, the one or more probe/primer has a sequence that is complementary to a sequence that differs from the central genomic region or parts thereof or parts thereof by about one, about two, about three, about four or about five nucleotides/bases. In one example, the one or more probe/primer has a sequence that is complementary to an approximately 4 kb region centred at an NDR. A skilled person would be able to determine the suitable conditions that would allow the probe/primer to hybridize to the one or more NDR.
In various embodiments, the one or more NDR comprises one or more NDR of a gene listed in one or more of Table 1, Table 2, Table S3, Table S10, Table S14, Table S15, Table S16, Table S17, Table S18, Table S19, Table S20 and Table S21. In various embodiments, the one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SLC11A1, NLRP12, PRTN3, HMBS, LILRB3, ACSL1, GP9, MX2, RASGRP4, ATG16L2 and combinations thereof. In various embodiments, the one or more NDR comprises one or more NDR of a gene selected from the group consisting of: SHKBP1, ACSL1, BCAR1, RAB25, PRTN3, LSR and combinations thereof.
In various embodiments, the kit/panel or the probe set/primer set further comprises a probe/primer for detecting a tumor-specific mutation.
In various embodiments, the one or more probe/primer comprises from about 50 to about 200 nucleotides/bases, from about 90 to about 150 nucleotides/bases or from about 110 to about 130 nucleotides/bases. In various embodiments, the one or more probe/primer comprises no more than about 200, no more than about 190, no more than about 180, no more than about 170, no more than about 160, no more than about 150, no more than about 140, no more than about 130 or no more than about 120 nucleotides/bases. In various embodiments, the one or more probe/primer comprises at least about 50, at least about 60, at least about 70, at least about 80, at least about 90, at least about 100, at least about 110 or at least about 120 nucleotides/bases. In various embodiments, the one or more probe/primer comprises about 120 nucleotides/bases.
In various embodiments, the one or more probe/primer comprises the sequence of one or more of SEQ ID NO: 1 to SEQ ID NO: 577 (i.e. SEQ ID NO; 1, SEQ ID NO:2, SEQ ID NO: 3, and so forth till SEQ ID NO: 577, see Supplementary Data 3) or a sequence sharing at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99% sequence identity thereto. In various embodiments, the sequence sharing at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 99% sequence identity with any one of SEQ ID NO: 1 to SEQ ID NO: 577 is capable of hybridizing/binding to the one or more NDR.
In various embodiments, the kit/panel or the probe set/primer set comprises a plurality of probes/primers.
In various embodiments, the kit/panel/primer set/probe set is for estimating a tumor burden or ctDNA burden associated with cancer, optionally colorectal cancer. In various embodiments, the one or more probe/primer is capable of hybridizing/binding to a genomic region of one or more of the following genes: ARID1A, CCNE1, CDH1, CDK6, CTNNB1, EGFR, ERBB2, KRAS, MUC6, MYC, RHOA, RNF43, SMAD4, TP53 or parts thereof. In various embodiments, the one or more probe/primer has a sequence that is complementary to a genomic region of one or more of ARID1A, CCNE1, CDH1, CDK6, CTNNB1, EGFR, ERBB2, KRAS, MUC6, MYC, RHOA, RNF43, SMAD4, TP53 or parts thereof. In various embodiments, the one or more probe/primer has a sequence that is complementary to a sequence sharing at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98% or at least about 99% sequence identity with the genomic region of one or more of ARID1A, CCNE1, CDH1, CDK6, CTNNB1, EGFR, ERBB2, KRAS, MUC6, MYC, RHOA, RNF43, SMAD4, TP53 or parts thereof. In various embodiments, the one or more probe/primer has a sequence that is complementary to a sequence that differs from the genomic region of one or more of ARID1A, CCNE1, CDH1, CDK6, CTNNB1, EGFR, ERBB2, KRAS, MUC6, MYC, RHOA, RNF43, SMAD4, TP53 or parts thereof by about one, about two, about three, about four or about five nucleotides.
In various embodiments, the one or more probes/primers cover at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99% or about 100% of the target NDR(s)/genomic region(s). In some embodiments, the one or more probes/primers do not overlap each other i.e. the probes/primer are aligned side-by-side when hybridized/bound to the target NDR(s)/genomic region(s). In some embodiments, there is some degree of overlap among adjacent probes/primers (e.g. an overlap of 10 bp, 30 bp, 50 bp, 70 bp, 90 bp etc.).
The number of probes/primers may vary depending on the number of target NDRs/genomic regions, the length/size of the target NDRs/genomic regions and/or the length/size of the probes/primers etc. Higher probe numbers/density may lead to better sampling, although it can also increase the cost of the method. In various embodiments, the number of probes/primers is in the range of from about 25 to about 50, from about 60 to about 80, from about 90 to about 110, from about 125 to about 150, from about 160 to about 180, from about 190 to about 210, from about 225 to about 250, from about 260 to about 280, from about 290 to about 310, from about 325 to about 350, from about 365 to about 390, from about 405 to about 430, from about 445 to about 470, from about 485 to about 510, from about 525 to about 550, or from about 565 to about 590. In various embodiments, the number of probes/primers is at least about 10, at least about 20, at least about 30, at least about 40, at least about 50, at least about 75, at least about 100, at least about 125, at least about 150, at least about 175, at least about 200, at least about 225, at least about 250, at least about 275 or at least about 300. In various embodiments, the number of probes/primers is no more than about 400, no more than about 375, no more than about 350, no more than about 325, no more than about 300, no more than about 275, no more than about 250, no more than about 225 or no more than about 200.
In various embodiments, there is provided a method or a product as described herein.
Example embodiments of the disclosure will be better understood and readily apparent to one of ordinary skill in the art from the following discussions and if applicable, in conjunction with the figures. It will be appreciated that the example embodiments are illustrative, and that various modifications may be made without deviating from the scope of the invention. Example embodiments are not necessarily mutually exclusive as some may be combined with one or more embodiments to form new exemplary embodiments.
It is shown that the size distribution of cfDNA fragments has a mode of −166 bp, suggesting that nucleosome-bound DNA fragments are protected/preserved during cell death and shed into the circulation. Nucleosome depleted regions (NDRs) are therefore more frequently degraded, yielding a nucleosome-dependent degradation footprint in cfDNA profiles, which can be used to infer tissue of origin. The read depth coverage from sequencing plasma cfDNA is shown to be able to identify nucleosome depletion at a gene's promoter and thus infer gene expression. The coverage of the nucleosome-depleted region at a gene's promoter is negatively correlated with the gene's expression level: a highly expressed gene will tend to have less nucleosome binding across its promoter and therefore lower level of protection and higher levels of DNA degradation. Moreover, plasma cfDNA degradation patterns in cancer patients can be used to infer tumor gene expression.
Here, it is hypothesized that a limited set of tumor or blood-specific NDRs could be used to infer the ctDNA burden (fraction) in the blood circulation of cancer patients. ctDNA burden refers to the relative amount of ctDNA out of all cfDNA molecules in a plasma sample. Using deep cfDNA WGS data from cancer patients and healthy individuals, a quantitative model that infers the ctDNA burden using cfDNA sequencing data from a limited set of NDRs is trained and test. This model is shown to be accurate for plasma samples from both colorectal cancer (CRC) and breast cancer (BRCA) patients (mean absolute error 4.3%), and deployment is explored using a compact targeted sequencing assay for low-cost and quantitative tracking of patient ctDNA dynamics.
The examples demonstrate two components. The first component is a method for estimating ctDNA burden specifically in liquid biopsies from colorectal cancer (CRC) patients. The second component is a method for estimating ctDNA burden in liquid biopsies from any solid tumor (pan-cancer). Both colorectal cancer and pan-cancer models have high prediction accuracy, but the pan-cancer model has the added advantage that it can be applied to any solid tumors.
In one example, the colorectal cancer ctDNA burden estimation model is built as follow. Machine learning was used to develop a predictive model that uses cfDNA coverage patterns at the promoter and junction regions of selected genes to infer ctDNA burden in the blood samples of colorectal cancer patients. The model was trained using data from an in silico “dilution” of 8 samples from 5 cancer patients and healthy individuals, resulting in a training set of 231 “virtual” samples of various ctDNA content (see Table S2). The candidate tumor/blood transcripts that showed both differential expression signal and differential DNA degradation signal at NDRs between CRC tumor and blood were shortlisted. The tumor and blood transcripts were pooled together and their promoter and junction NDR coverage scores were defined as (totally 908) input “features” (see Table S3). The coverage value of each position was normalized by the mean coverage of the upstream (−2000 to −1000 bp) and downstream (+1000 to +2000 bp) regions with respect to transcription start site (for promoter) and exon boundary (for junction) respectively. A Lasso (least absolute shrinkage and selection operator) model was employed to identify features predictive of ctDNA proportions. Half of the training data was extracted randomly to run Lasso (using 1000 repetitions), consequently discovering 6 stable features (probability ≥0.99) from this stability-based exploration (
The model may also be applicable to other cancer types, subtypes, or specific therapeutic settings, considering tissue-of-origin of cfDNA molecules can be principally informed from tissue-specific DNA degradation pattern. Compared with plasma from healthy people, tumor-derived DNA component in cancer plasma samples weakens the blood-specific DNA degradation pattern, which suggests the decay of blood-specific signal might be informative of robustly estimating the ctDNA content regardless of cancer types. Therefore, the ctDNA content estimation method is also extended to the pan-cancer level.
In one example, a pan-cancer ctDNA burden estimation model is built as follows. This pan-cancer model relates to a quantitative method that only uses blood-based features/regions (and no use of tumor type specific regions). First, blood transcripts that are highly expressed in blood and lowly expressed in tumors of all 20 cancer types (BLCA, BRCA, CESC, CRC, ESCA, GBM, HNSC, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, OV, PAAD, PRAD, SKCM, STAD, THCA, and UCEC) where shortlisted. These yielded 792 promoter and junction NDR candidate coverage features (Table S10). 215 in-silico samples diluted from the plasma samples of 7 breast cancer (BRCA) patients were added into the existing training set of colorectal cancer samples, as well as 93 in-silico samples diluted from the plasma samples of 3 BRCA patients into the existing test set (see Table S2). The same protocol of feature selection and model training that were used for colorectal cancer ctDNA burden estimation above was performed. It was found that using around 10 blood features is able to predict the ctDNA content in plasma samples (see
Based on the guidance provided by this disclosure, users can follow the methodology details to reproduce the work or apply the method to their own data with a full flexibility of tuning the number of features for their model, as long as the selection can achieve high prediction accuracy and prevent data over-interpretation. As described in the examples herein, users can check the error evolution with the number of top features to determine a reasonable range of numbers of features.
Embodiments of a machine learning model based on expression-specific DNA degradation patterns to predict ctDNA fractions for potentially clinical use are described herein (
Blood samples (n=29) were collected from healthy individuals and plasma cfDNA was extracted for paired-end WGS (merged ˜150× coverage) (
Association of Gene Expression and cfDNA Fragmentation Patterns
Analysis of cfDNA from the healthy individuals revealed nucleosome depletion and reduced cfDNA protection flanked by a series of strongly positioned nucleosomes at gene promoter regions (
To further explore the hypothesis that NDR cfDNA coverage in plasma samples from cancer patients is associated with the epigenetic state of tumor cells, a targeted sequencing panel was first used to screen plasma samples from CRC patients for cases of high ctDNA burden (VAF >15% for known cancer driver mutations,
Quantitative Estimation of Colorectal Cancer ctDNA Burden
With the insight that cfDNA coverage at NDRs is associated with the transcriptional state of DNA in the tumor cells, it was hypothesized that cfDNA coverage at a small set of NDRs could be used to infer the ctDNA burden (fraction of tumor DNA out of all cfDNA) in the blood plasma of a cancer patient. As training data, 8 deep WGS samples from 5 CRC patients were in silico “diluted” with data from healthy individuals, resulting in a training set of 231 samples of ctDNA proportions ranging from 0.5% up to the original undiluted fractions (
To further evaluate the robustness of the model when tested on in silico samples generated using healthy samples not seen during model training, the healthy samples (n=29) were split into two different groups to separately generate in silico training and test data. Reassuringly, this analysis showed robust model performance in the presence of independent train/test healthy samples (
Next, the predictive performance of the model was compared with ichorCNA (Adalsteinsson V A, et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8, 1324 (2017)), a method that estimates the ctDNA fraction on the basis of arm-level copy number alterations in low-pass WGS data. Overall, ichorCNA generally predicted comparable estimates of ctDNA burden (
Apart from the 6 robust features identified, there may exist other predictive features that correlate with ctDNA burden. A step-wise search with Lasso regression on all 344 in silico samples was performed and the top stable 6 features in each step were extracted to estimate ctDNA fractions. The search was repeated for 100 independent times, followed by pooling all predictive features with a deviation threshold of 3% (
Targeted NDR Assay to Estimate ctDNA Burden
Intriguingly, since the predictive models used data from only a few NDRs, it was hypothesized that a targeted sequencing approach could be deployed for robust and low-cost estimation of ctDNA burden. The CRC model only requires cfDNA relative coverages at 6 NDRs (Table 1). The inventors therefore designed capture probes for these 6 regions (total ˜24 kb) and performed targeted sequencing (˜300×) on 53 new plasma samples from CRC patients (
NDR-Based Monitoring of ctDNA Dynamics and Disease Progression
To further explore how NDR-based ctDNA burden estimation could be used for low-cost monitoring of cancer progression, targeted NDR assay was applied to serial plasma samples collected from five CRC patients (
It was noted that a number of plasma samples for which the NDR-based ctDNA burden was inferred to be positive, yet the variant calling pipeline identified no SNVs under default settings. To further understand this discordance, the raw sequencing data in these “mutation-free” plasma samples was manually inspected. Indeed, when searching for variants that were identified in other samples/timepoints from the same patients, the raw sequencing data supported presence of the expected SNVs in all the samples with positive NDR-quantified ctDNA burden (Table S9). In contrast, one plasma sample (patient C1531, day 191) was quantified with zero ctDNA burden by the NDR approach and manual screening confirmed absence of TP53 and APC mutations in this sample (Table S9). Overall, these results highlight the robustness of the targeted NDR assay for ctDNA burden estimation.
It was next explored how ctDNA burden dynamics correlate with response to targeted or cytotoxic treatments. Patient C357 was treated with Regorafenib (days 821-842 after diagnosis) followed Trifluridine (days 979-1026). However, ctDNA burden estimation in this time interval (days 800-1056) showed no drop in ctDNA burden following either treatment, indicating tumor resistance to both drugs; end-treatment CT scans (Day 916 and 1056 respectively) confirmed progressive disease. In contrast, patients with positive response to treatment showed a marked reduction of ctDNA burden in plasma. For example, patient C1531 received the chemotherapy regimen of FOLFOXIRI (days 82-175) and had on and post-treatment CT scans showing partial response. Strikingly, this patient showed a concomitant and marked drop in ctDNA burden both during (day 160) and after (day 191) treatment. In patient C575, TP53 and ATR mutations were only identified at two out of four timepoints by the pipeline. In this patient, both CT scans and ctDNA burden estimation inferred stable disease during the first round of XELOX/bevacizumab treatment (days 612-833). However, during the second round of treatment, both the ctDNA burden increased (day 864) and CT scans confirmed progressive disease, indicating acquired drug resistance. Finally, discrepancies between tumor dynamics inferred from CT imaging and ctDNA burden has previously been reported. Patient C519 reflected one such example, where CT scans indicated progressive disease while both ctDNA burden estimates and mutation VAFs decreased.
Estimation of ctDNA Burden Across Cancer Types
The predictive model for CRC ctDNA burden included 3 (out of 6) NDR coverage features from genes overexpressed in whole blood. Intriguingly, a predictive model completely restricted to blood-specific genes could hypothetically quantify the extent that a cfDNA profile deviates from a healthy baseline profile, allowing prediction of ctDNA burden across different cancer types. Indeed, the inventors were able to identify genes overexpressed in whole blood compared to solid tumor tissue that also had decreased NDR coverage in plasma samples from healthy individuals as compared to patients of distinct cancer types (
A model fitted with the training data using the top 10 predictive features (Table 2) had a mean absolute error of 2.2%, with comparable accuracy in CRC and BRCA samples (
Lasso regression with all 792 blood features was employed to identify all potential predictive combination of pan-cancer features. A step-wise extensive search was carried out on all the 652 in silico samples (see Table S2), and the top 10 features in each step were extracted to estimate ctDNA content (
An additional search for predictive and feature combinations for both the CRC and pan-cancer models was performed. While this search was implemented as previously described (
Monitoring of ctDNA offers a non-invasive approach to tracking disease progression and has been demonstrated as a valuable real-time tool for assessing therapeutic response. Here, it is shown that cfDNA coverage patterns at tumor and blood-specific NDRs can be used for quantitative estimation of the ctDNA burden in blood plasma samples. While SNV VAFs can be used as a proxy for the ctDNA burden, this only works for the subset of patients with known and measured clonal SNVs in a given targeted gene panel. SNV-based approximation of ctDNA burden may be further challenged by clonal haematopoiesis, which is frequently observed in cancer patients. Additionally, absolute ctDNA fraction estimation from SNVs requires co-estimation of allele zygosity and clonality, which may be challenging to infer for metastatic patients with multiple independently evolving tumors contributing ctDNA to the blood circulation. Furthermore, in low ctDNA burden samples, which are common and clinically important, NDR-based burden estimation showed improved accuracy as compared to a Ip-WGS-based estimation method. In contrast to Ip-WGS and DNA methylation-based profiling, NDR-based estimation is directly compatible with targeted gene panel sequencing. Since the ctDNA burden estimation model requires data from 10 or less NDRs, these regions can be profiled at low cost by capturing <25 kb of genomic sequence. Targeted cfDNA assays often cover hundreds of genes and >1 Mb captured genomic sequence, with larger panels required for profiling across cancer types and tumour mutation burden estimation. It would be straightforward to co-profile NDRs in such assays, with only a minor increment in panel size. Furthermore, down-sampling analysis showed that the NDR approach is robust down to 100× sequence coverage (
Nucleosome positioning across gene bodies, and its association with transcriptional activity, has been studied using both biochemical assays and cfDNA profiles. Unexpectedly, the systematic analysis across ordered exon-intron junctions revealed that, in addition to the promoter, only the first exon-intron junction showed signatures of strong nucleosome and expression-dependent cfDNA degradation (
In summary, the inventors have shown how tissue and expression-specific cfDNA degradation at NDRs can be used to quantitatively estimate ctDNA burden in blood samples. The approach is directly compatible with targeted gene sequencing, allowing for low-cost and simultaneous discovery of actionable cancer mutations and accurate estimation of ctDNA burden. It is anticipated that next-generation cfDNA assays based on these findings will be useful for quantitatively tracking and analysing cancer disease progression across time and patients.
Cancer patient and healthy volunteer samples were collected under studies 2013/110/B (now 2018/2795) and 2012/733/B approved by the Singhealth Centralised Institutional Review Board. Plasma was separated from blood within 2 hours of venipuncture via centrifugation at 10 min×300 g and 10 min×9730 g, and then stored at −80° C. DNA was extracted from plasma using the QlAamp Circulating Nucleic Acid Kit following manufacturer's instructions. Sequencing libraries were made using the KAPA HyperPrep kit (Kapa Biosystems, now Roche) following manufacturer's instructions and paired-end sequenced (2×151 bp) on either an Illumina Hiseq4000 or HiseqX.
A targeted sequencing panel (Table S7) was first used to screen plasma samples from CRC patients and 12 samples (Table S1) of likely high ctDNA burden were selected, having maximum VAF >15% for known CRC cancer driver mutations (Supplementary Data 1). Similarly, 10 BRCA plasma samples of high ctDNA burden were selected, with either VAF >15% based on a panel of 77 genes (Table S12) of common breast cancer mutations (Supplementary Data 2), or alternatively, significant proportions (>20%) of short (length <150 bp) cfDNA fragments (Table S1). It has been reported that short cfDNA fragments below 150 bp are enriched in high-ctDNA plasma samples. Deep WGS (˜90×) was performed on the 12 cfDNA samples from 7 CRC patients and 10 cfDNA samples from 10 BRCA patients (Table S1). For the 5 CRC patients with 2 samples each, there was at least a 12 months interval between the two samples. Bwa-mem (Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997, (2013)) was used to align the WGS reads from healthy (n=29, ˜5× coverage), cancer (CRC n=12, BRCA n=10, ˜90× coverage), and germline samples (CRC n=12 ˜30× coverage, not available for BRCA) were matched to the hg19 human genome. Duplicates were marked using biobambam (Tischler G, Leonard S. biobambam: tools for read pair collation based algorithms on BAM files. Source Code for Biology and Medicine 9, 13 (2014)). It has been found that trimming reads from both ends increased the coverage signal of nucleosome positioning. Similarly, the original reads (˜151 bp) were trimmed from the two ends and the central 61 bp was preserved to amplify the nucleosome-associated DNA degradation signal. BAM files of healthy individuals were merged using SAMtools merge function (Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078-2079 (2009)). Low-pass WGS (˜4×) was performed on 53 cfDNA samples from 23 CRC patients (Table S6).
Plasma and patient-matched buffy coat samples were isolated from whole blood within six hours from collection and stored at −80° C. DNA was extracted with the QlAamp Circulating Nucleic Acid Kit, followed by library preparation using the KAPA HyperPrep kit. All libraries were tagged with custom dual indexes containing a random 8-mer unique molecular identifier. Targeted capture was performed on xGen custom panels (Integrated DNA Technologies) relevant to the experiment: a) panel of 100 genes selected based on literature review for relevance to colorectal and breast cancer, see Table S7, or b) capture probes (Supplementary Data 3) targeting genomic regions (4 kb centered at the sites in Table 1) related to the 6 NDRs predictive of ctDNA content in colorectal cancer. Paired-end sequencing (2×151 bp) was done on an Illumina Hiseq4000 machine.
Sequencing data was analyzed using the bcbio-nextgen pipeline (Guimera R V. bcbio-nextgen: Automated, distributed next-gen sequencing pipeline. EMBnetjournal 17, 30 (2012)), including read alignment with BWA mem, PCR duplicate marking with biobambam, as well as recalibration and realignment with GATK (DePristo M A, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43, 491 (2011)). Somatic variant calling was performed using MuTect (Cibulskis K, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature biotechnology 31, 213 (2013)) and VarScan (Koboldt D C, et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283-2285 (2009)) with default parameters, and all calls were annotated with Variant Effect Predictor (McLaren W, et al. The ensembl variant effect predictor. Genome biology 17, 122 (2016)). Variants were removed if they were outside coding regions. The inferred VAFs were either from one of the two callers if the variant was missed by one caller, or the mean if the variant was called by both callers (Table S8). Variants from HLA-A, KMT2C and MUC17 were filtered because the majority of variants in these genes were also found by at least one caller at >=0.005 VAF in buffy coat sequencing.
Tissue-specific RNA-seq transcript expression data was obtained from GTEx (including 337 whole blood samples; Table S14). Tumor RNA-seq transcript expression was obtained from TOGA (Table S14). Because a gene usually comprises multiple alternative transcripts with different genomic positions, gene expression was studied at the transcript level for a precise mapping of promoter and junction locations. Transcripts of all coding genes were grouped on the basis of their expression level (fpkm) in whole blood. If a group (e.g. 0.1<fpkm≤1; 25155 transcripts) had more than 5000 transcripts, 5000 transcripts were randomly to represent the group. Unexpressed genes were defined as transcripts that were not expressed in 99% of all 7861 GTEx samples.
Read coverage at promoter and junction regions was computed from BAM files with SAMtools depth function (Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078-2079 (2009)). For the promoter region (−150 to 50 bp relative to TSS), the mean raw coverage across the region was divided (yielding “relative coverage”) by the mean coverage of the upstream (−2000˜-1000 bp relative to TSS) and downstream (1000˜2000 bp relative to TSS) flanks (
where mean(CRC) and mean(healthy) are the mean of average relative coverages at NDRs across CRC plasma and healthy plasma samples respectively, and s.d. (CRC) is the standard deviation of average relative coverages at NDRs across CRC samples. The variance in healthy samples could not be estimated due to low sequencing depth (˜5×). When computing average relative coverage of each NDR (either −150 to 50 bp relative to TSS, or −300 to −100 bp relative to first exon end), positions with relative coverage >2 were truncated to reduce bias from potential outlier values.
To explore the association between relative coverage and a range of epigenetic features, linear regression was used to fit each candidate feature (covariate) with relative coverage (response). Whole blood gene expression (fpkm) was discretized into 6 bins [unexpressed, 0.01<fpkm≤0.1, 0.1<fpkm≤1, 1<fpkm≤5, 5<fpkm≤30, fpkm >30] and fitted as a categorical covariate with the unexpressed group as the reference group. Peaks of epigenetic features [DNase, H3K4me3, H3K36me3, H3K27ac, H3K4me1, H3K9me3 and H3K27me3] from primary T-cells (E034) were obtained from the Roadmap Epigenomics Project. Epigenetic features were fitted as binary covariates with no signal as the reference group.
Estimation of ctDNA Fractions from Deep WGS cfDNA Data
The ctDNA fractions in CRC plasma samples were quantified using four different methods: THetA2, TitanCNA, AbsCN-seq and PurBayes (Oesper L, Satas G, Raphael B J. Quantifying tumor heterogeneity in whole-genome and whole-exome sequencing data. Bioinformatics 30, 3532-3540 (2014); Ha G, et al. TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data. Genome research 24, 1881-1893 (2014); Bao L, Pu M, Messer K. AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data. Bioinformatics 30, 1056-1063 (2014); Larson N B, Fridley B L. PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data. Bioinformatics 29, 1888-1889 (2013)). These methods were originally developed to use matched tumor tissue and germline Exome/WGS data to estimate mutation and copy number tumor heterogeneity, including tumor purity. Here, these methods were applied to the ˜90× cfDNA and ˜30× matched germline (buffy coat) WGS data to estimate ctDNA fractions. Somatic mutations and copy number alterations, as input to AbsCN-seq and PurBayes, were called by SMuRF (Huang W, Guo Y A, Muthukumar K, Baruah P, Chang M M, Skanderup A J. SMuRF: Portable and accurate ensemble prediction of somatic mutations. Bioinformatics (Oxford, England), (2019)) and CNVkit (Talevich E, Shain A H, Botton T, Bastian B C. CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing. PLoS Comput Biol 12, e1004873 (2016)), respectively, using the bcbio-nextgen workflow (Guimera R V. bcbio-nextgen: Automated, distributed next-gen sequencing pipeline. EMBnetjoumal 17, 30 (2012)). The median of these four ctDNA fraction estimates for a given sample was used as the final consensus estimate of the ctDNA fraction. Since germline samples were not available for the BRCA patients, the ctDNA fractions of the BRCA plasma samples were estimated by ichorCNA (Adalsteinsson V A, et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8, 1324 (2017)).
The cancer cfDNA samples were in silico diluted by mixing cancer cfDNA reads with reads from healthy samples, maintaining the same average coverage as the original undiluted cancer cfDNA sample. The in silico generated samples were diluted from ctDNA content ranging from 0.005 up to the original undiluted fractions, with a denser sampling of low fractions 0.05 (Table S2). The inventors generated a training set of 231 samples originating from 8 samples from 5 CRC patients, and a test set of 113 samples originating from 4 samples from 2 additional CRC patients. For BRCA, the training set comprised 215 in silico generated samples from 7 patients/samples, and the test set had 93 samples from 3 patients/samples (Table S2).
The relative coverage score (see above) of NDRs for all transcripts was computed and the relative coverage score was combined with expression data to shortlist tumor/blood-specific transcripts associated with differential tumor/blood NDR cfDNA coverage. For each transcript, the inventors calculated its median fpkm (fpkmblood) across all whole blood samples, its median fpkm (fpkmCRC) across all CRC samples, as well as its respective median fpkm values for other tumor types. Tumor transcripts were defined as being highly expressed in CRC tumor, lowly expressed in normal blood cell, and more highly degraded in CRC samples at both promoter and junction NDRs (fpkmCRC>10, fpkmblood<1, relative coverage score <−0.2). Blood transcripts were defined with similar rules (fpkmCRC<1, fpkmblood>10, relative coverage score >0.2). This approach shortlisted 284 CRC and 210 blood transcripts, each transcript with two features (promoter and junction NDR coverage). After removing overlapping features (multiple transcripts sharing the same NDR), NDR coverages of the resulting 529 tumor and 379 blood features (total n=908) were used as input features for predictive modelling. For the CRC+BRCA model, only transcripts with blood-specific expression (fpkmblood>5) that were also lowly expressed (fpkm <1) in tumors of all 20 tumor types were shortlisted (TCGA tumor type acronyms: BLCA, BRCA, CESC, CRC, ESCA, GBM, HNSC, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, OV, PAAD, PRAD, SKCM, STAD, THCA, UCEC), leading to a total of 792 features.
Lasso Regularized Regression to Predict ctDNA Fraction
Lasso regularized linear regression using glmnet (Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33, 1 (2010)) was used to select features and predict ctDNA content in plasma cfDNA samples. To select robust features, half of the training data was first extracted randomly and Lasso with ten-fold cross-validation was used to identify features predictive of ctDNA fractions. This procedure was repeated 1000 times and the top stable features (selection frequency 0.99) were extracted as the final predictive features, which resulted in 6 predictive features (Table 1) for the CRC-specific model and 10 predictive features (Table 2) for the CRC+BRCA model, respectively. The inventors trained the final predictive model with ten-fold cross-validation on the full training set. The inventors also attempted to predict ctDNA fractions with log-transformed relative coverage, and tested the performance using a logistic regression model, both of which failed to outperform the current model in prediction accuracy (data not shown).
To evaluate the robustness of the model when was trained and tested on in silico samples generated using independent healthy samples, the normal samples were split evenly into 2 sets. The first set (N1) was used to perform in silico spike-ins/dilution of the training set, and the second set (N2) was used for in silico dilution of the test set. Briefly, the coefficients of the CRC model (comprising the 6 features in Table 1) were re-fitted using the training data (diluted with the N1 healthy samples), and the model accuracy on the withheld test samples (diluted with N2) were then evaluated. This procedure was repeated 10 times and the model accuracy on the test data generated using the independent normal samples was evaluated.
ichorCNA Benchmarking
For the in silico samples, of DNA reads from the 12 deep-WGS CRC samples were mixed with reads from healthy samples to generate in silico low-pass samples (˜0.1×) for ctDNA content estimation using ichorCNA. The usage guidelines with default parameters were followed in the 2 step workflow: 1) read count coverage calculation with HMMcopy Suite, and 2) tumor content estimation with ichorCNA R package.
cfDNA sequencing data have been deposited at the European Genome-phenome Archive (EGA) under the accession code EGAS00001004657. The data is made available for academic research. Data will be released subject to a data transfer agreement.
Table S1. ctDNA burden estimation of plasma samples from cancer patients.
Table S2. The in silico samples of various ctDNA content.
Table S3. Information on all candidate features of nucleosome-depleted regions for colorectal cancer.
Table S4. Coefficients for the selected NDRs in the trained models.
Table S5. Observed ctDNA fractions in the LOD analysis for the CRC model.
Table S6. CRC plasma samples for Ip-WGS and targeted sequencing.
Table S7. A panel of 100 genes frequently mutated in colorectal and breast cancer.
Table S8. Variant allele frequency estimation of CRC plasma samples.
Table S9. Mutations missed by the callers for the CRC patients with serial plasma samples.
Table S10. Information on all candidate pan-cancer features of nucleosome-depleted regions.
Table S11. Observed ctDNA fractions in the LOD analysis for the CRC+BRCA model.
Table S12. A panel of 77 genes for screening breast cancer samples.
Table S13. Transcript expression data.
Table S14. Information on all predictive features for colorectal cancer
Table S15. Information on predictive features for colorectal cancer
Table S16. Information on additional predictive pan-cancer features
Table S17. Information on predictive pan-cancer features
Table S18. All predictive feature combinations for CRC using in silico samples generated with random subsets of healthy samples.
Table S19. Information on predictive features for CRC using in silico samples generated with random subsets of healthy samples.
Table S20. All predictive pan-cancer feature combinations using in silico samples generated with random subsets of healthy samples.
Table S21. Information on predictive pan-cancer features using in silico samples generated with random subsets of healthy samples.
34 8 135 278 428 621 80 7 8 78
2 137 236 258 288 304 338 388 674 778
9 242 278 428 61 621 680 742 768 789
34 89 135 238 242 428 742 788 789
242 311 428 454 634 6 0 742 768 789
52 4 89 236 258 288 338 376 435
9 135 137 20 311 428 4 4 680 7 8 789
4 137 284 304 621 634 45 6 3 772 778
9 20 278 311 428 454 34 680 788 789
34 97 137 201 236 242 735 772 778
37 52 64 135 242 304 621 742 772
37 4 201 205 206 259 3 7 630 735
34 37 64 201 20 634 653 738 772
152 206 306 376 435 544 607 708 766
52 89 2 8 259 288 304 338 456 30
2 152 30 402 450 456 544 08 45
52 64 278 367 547 634 653 735 772
137 205 259 30 4 4 621 63 80 735
37 52 84 258 311 314 338 454 6
52 89 126 258 28 30 338 454 630
201 288 331 3 7 392 547 674 7 778
7 137 205 311 37 428 454 680 743 768
37 52 64 201 242 27 304 772 77
35 135 242 278 428 621 680 7 8 772
4 201 258 278 338 415 453 607 630 634
indicates data missing or illegible when filed
Profiling of ctDNA may offer a non-invasive approach to estimate disease burden and monitor disease progression. Embodiments of the method described herein provide a quantitative method, which exploits local tissue-specific and gene-specific cfDNA degradation patterns, that can accurately estimate ctDNA burden independent of genomic aberrations.
Nucleosome-dependent cfDNA degradation at selected NDRs (e.g. promoters and first exon-intron junctions) is shown herein to be strongly associated with differential transcriptional activity in tumors and blood. A machine learning model that was developed based on expression-specific DNA degradation patterns was found to be capable of accurately predicting ctDNA fractions (see examples). Leveraging on these findings, embodiments of the methods enable for the first time the detection of tumor DNA burden (even of very low frequency) in blood by only sequencing selected NDRs in cfDNA assays. From only less than 50 kb DNA sequence in total (4 kb×6 features or 4 kb×10 features), embodiments of the methods can accurately predict ctDNA levels, and thereby monitor the dynamics of the systemic tumor burden over time from blood/liquid samples. Indeed, using compact targeted sequencing (<25 kb) of predictive regions, the disclosure demonstrates how embodiments of the method enable quantitative low-cost tracking of ctDNA dynamics and disease progression.
Embodiments of the method enjoy several advantages including cost efficiency, flexibility, high accuracy and high sensitivity.
Embodiments of the method requires less sequencing and are therefore cost-efficient. In embodiments of the method, 100× less DNA sequencing (e.g. ˜30 kb at 100× coverage) is needed than low-pass WGS-based methods requiring whole genome sequencing at ˜0.1×. The sequencing cost is also comparable to sequencing a panel at 10,000× (usual target for coding mutation panels). Embodiments of the method also require less sequencing than standard targeted sequencing assays, which usually require more than 1000 kb DNA sequence.
Further, embodiments of the method can be implemented as an extension/add-on to a standard targeted panel assay, providing flexibility and further allowing for an extremely cost-effective approach to generic ctDNA profiling. For example, the NDRs identified herein can be easily added to existing cfDNA capture panels, eliminating the need to perform two separate assays. Notably, WGS or methylation-based assays do not enjoy this flexibility.
Last but not least, embodiments of the method are capable of accurately estimating cancer cell-free DNA burden with a mean deviation of about 3.4%. As compared to conventional coding panel that usually fail (no mutations) in more than 20-30% of patients, embodiments of the method are shown to be able to accurately predict cancer cfDNA in most cancer patients. As demonstrated in the examples, both colorectal cancer and pan-cancer models have high prediction accuracy, with the pan-cancer model generalizing well to most/all solid tumor types.
Overall, embodiments of the method enable quantitative low-cost tracking of ctDNA dynamics and disease progression, and would be invaluable in the clinical setting.
It will be appreciated by a person skilled in the art that other variations and/or modifications may be made to the embodiments disclosed herein without departing from the spirit or scope of the disclosure as broadly described. For example, in the description herein, features of different exemplary embodiments may be mixed, combined, interchanged, incorporated, adopted, modified, included etc. or the like across different exemplary embodiments. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.
Number | Date | Country | Kind |
---|---|---|---|
10201912600T | Dec 2019 | SG | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SG2020/050766 | 12/18/2020 | WO |