The present disclosure relates in general to methods for detecting and diagnosing cancer, in particular lung cancer, at early stages of the disease.
Lung cancer is the most lethal cancer in the world1. The 5-year survival rate is less than 20%2 largely due to the late stage at diagnosis where treatments are less effective than at earlier stages, and the incidence of lung cancer continues to increase worldwide3. Although large randomized trials have demonstrated that lung cancer screening using chest low dose computed tomography (LDCT) decreases mortality in high risk individuals4,5, LDCT remains underutilized, with less than 6% of at-risk individuals screened, due to concerns of potential harm from false positive imaging results, radiation exposure, and morbidity from invasive diagnostic procedures6-8.
We now provide new non-invasive methods for detecting and diagnosing cancer, in particular lung cancer, at early stages of the disease. Accordingly, in certain embodiments, a method of diagnosing cancer in a subject, comprises extracting cell free (cfDNA) from the subject's biological sample; generating genomic libraries from the extracted cfDNA and whole genome sequencing of cfDNA fragments; mapping of the cfDNA fragments to a genomic origin and evaluating fragment length and obtaining genome-wide fragmentation profiles for each sample; identifying protein biomarkers of the subject; comparing the subject's cfDNA fragmentation profile and protein biomarkers with normal reference non-cancer subjects. In certain embodiments, the cancer is lung cancer.
In certain embodiments, the method further comprises subjecting the subject to low dose helical computed tomography (LDCT). In certain embodiments, the method further comprises comparing clinical data between the subject diagnosed as having lung cancer and non-cancer subjects. In certain embodiments, the cfDNA fragment mean length and profiles are similar among non-cancer individuals. In certain embodiments, the cfDNA fragment profiles of cancer subjects vary. In certain embodiments, the serum levels of or one or more tumor antigens, cytokines or proteins are measured.
In certain embodiments, the one or more tumor antigens comprise: carcinoembryonic antigen (CEA), CA19-9, CA 125, tissue polypeptideantigen (TSA), CYFRA-21-1, neuron-specific enolase, progastrin-releasing peptide (ProGRP), plasma kalikrein B1 (KLKB1), serum amyloid A, haptoglobin-alpha-2, ADAM-17, osteoprotegerin, pentraxin 3, follistatin, tumor necrosis factor receptor superfamily member 1A or combinations thereof.
In certain embodiments, the one or more proteins comprise C-reactive protein (CRP), Chitinase-3-like protein 1 (YKL-40/CHI3L1) or fragments thereof.
In certain embodiments a DELFI (DNA evaluation of fragments for early interception) score is generated, wherein the principle component analysis is incorporated into a machine learning predictive model to generate a score for each subject as an average over cross-validation repeats (DELFI score(s)). As an example, due to the high dimensionality of the fragmentation features relative to the number of available samples for training, a principal component analysis was performed within each training set to reduce the dimensionality of the feature space, retaining the minimum number of principal components needed to explain 90% of the variance of the fragmentation profiles between samples. In addition to the principal component features, all 39 z-scores were evaluated in a logistic regression model with a LASSO penalty. The optimized LASSO penalty in our analysis was obtained by resampling using the caret R package. The DELFI score derived for each sample corresponds to the mean score across the 10 cross validation repeats. References herein to “DELFI score” are values determined by this above specified procedure.
In certain embodiments, the DELFI scores for non-cancer individuals are less than about 0.3. In certain embodiments, the DELFI scores for stage I cancer are between about 0.3 to less than 0.5. In certain embodiments, the DELFI scores for stage II cancer are between about 0.5 to less than 0.8. In certain embodiments, the DELFI scores for stage III cancer are between about 0.8 to less than 0.99. In certain embodiments, the DELFI scores for stage IV cancer are about 0.99 or greater. In certain embodiments, the DELFI score for stage I cancer is about 0.35. In certain embodiments, the DELFI score for stage II cancer is about 0.75. In certain embodiments, the DELFI score for stage III cancer is about 0.9. In certain embodiments, the DELFI score for stage IV cancer is about 0.99.
In certain embodiments, a method of diagnostically distinguishing between subjects with small cell lung cancer (SCLC) from those with non-small cell lung cancer (NSCLC) or without cancer, comprises: comparing differential expression of transcription factors in biological samples of SCLC, NSCLC or white blood cells; selecting at least one or more transcription factors having a higher differential expression as compared to the expression of transcription factors identified in the biological samples; extracting cell free (cfDNA) from the subject's biological sample; obtaining genome-wide fragmentation profiles of the cfDNA obtained from the subject to identify the at least one or more transcription factor binding sites; evaluating cfDNA coverage of the at least one or more transcription factor binding sites to determine fragment coverage and size as compared to non-cancer subjects or NSCLC subjects; thereby, diagnostically distinguishing between subjects with small cell lung cancer (SCLC) from those with non-small cell lung cancer (NSCLC) or without cancer.
In certain embodiments, the at least one transcription factor is Achaete-Scute Family basic helix-loop-helix Transcription Factor 1 (ASCL1).
In certain embodiments, the cfDNA fragment sizes in nucleic acid sequences comprising ASCL1 binding sites are larger in SCLC patients as compared to patients with NSCLC or non-cancer subjects.
In certain embodiments, the aggregate fragment coverage in nucleic acid sequences comprising ASCL1 binding sites is decreased in SCLC patients as compared to patients with NSCLC or non-cancer subjects.
In certain embodiments, a method of diagnostically distinguishing between subjects with small cell lung cancer (SCLC) from those with non-small cell lung cancer (NSCLC) or without cancer, the method comprises: extracting cell free (cfDNA) from the subject's biological sample; evaluating cfDNA coverage of Achaete-Scute Family basic helix-loop-helix Transcription Factor 1 (ASCL1) binding sites to determine fragment coverage and size as compared to non-cancer subjects or NSCLC subjects; thereby, diagnostically distinguishing between subjects with small cell lung cancer (SCLC) from those with non-small cell lung cancer (NSCLC) or without cancer.
In certain embodiments, the cfDNA fragment sizes in nucleic acid sequences comprising ASCL1 binding sites are larger in SCLC patients as compared to patients with NSCLC or non-cancer subjects.
In certain embodiments, the aggregate fragment coverage in nucleic acid sequences comprising ASCL1 binding sites is decreased in SCLC patients as compared to patients with NSCLC or non-cancer subjects.
In certain embodiments, the subject's fragmentation profile provides cell type-specific genome-wide transcription factor binding and is diagnostic of the type of lung cancer and histological subtypes.
In certain embodiments, the subject is administered cancer therapies.
In certain embodiments, the method of determining recurrence of cancer in a subject comprises the methods embodied herein.
In certain embodiments, the method of correcting GC content of a genome-wide fragmentation analyses, comprises: sequencing of whole genome libraries of cancer subjects and cancer-free subjects from samples not subjected to polymerase chain reaction (PCR) and samples subjected to a variable number of PCR cycles, filtering of adapter sequences, aligning sequence reads against a human reference genome and removing of duplicate reads, converting each aligned pair to a genomic interval, wherein the genomic interval represents sequenced DNA fragments, and selecting reads having a mapq score of at least 30 or greater.
In certain embodiments, the method further comprises tiling the reference genome into about 100-600 non-overlapping 1-10 Mb bins spanning about 1-3 GB of the genome thereby capturing large-scale epigenetic differences in fragmentation across the genome from low-coverage whole genome sequencing.
In certain embodiments, the ratios of the number of short to long (151 to 220 bp) fragments across the 100-600 non-overlapping 1-10 Mb bins were normalized for GC-content and library size.
In certain embodiments, the method further comprises obtaining the total number of fragments within each GC stratum comprises assigning of fragments to one of about 100 possible GC strata between 0 and 1.
In certain embodiments, the 1 indicates a fragment with all G and C nucleotides.
In certain embodiments, the method further comprises obtaining a distribution of fragment counts by GC stratum for non-cancer samples and the median of target distributions.
In certain embodiments, the, normalizing of sample-to-sample PCR biases in subjects comprises, assigning a weight wi to the fragments in GC stratum i such that Σi=1N
In certain embodiments, the fragmentation profiles are consistent among non-cancer subjects and subjects with non-malignant lung cancer.
In certain embodiments, the cancer subjects display widespread genome-wide variation.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, to the extent that the terms “including”, “includes”, “having”, “has”, “with”, or variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”
The term “about” or “approximately” means within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, i.e., the limitations of the measurement system. For example, “about” can mean within 1 or more than 1 standard deviation, per the practice in the art. Alternatively, “about” can mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a given value or range. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude within 5-fold, and also within 2-fold, of a value. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value should be assumed.
The terms “aligned”, “alignment”, “mapped” or “aligning”, “mapping” refer to one or more sequences that are identified as a match in terms of the order of their nucleic acid molecules to a known sequence from a reference genome. Such alignment can be done manually or by a computer algorithm, examples including the Efficient Local Alignment of Nucleotide Data (ELAND) computer program distributed as part of the Illumina Genomics Analysts pipeline. The matching of a sequence read in aligning can be a 100% sequence match or less than 100% (non-perfect match).
The term “alternative allele” or “ALT” refers to an allele having one or more mutations relative to a reference allele, e.g., corresponding to a known gene.
The term “biomarker” means a distinctive biological or biologically derived indicator of a process, event or condition. Biomarkers can be used in methods of diagnosis, e.g. clinical screening, and prognosis assessment; and in monitoring the results of therapy, for identifying patients most likely to respond to a particular therapeutic treatment, as well as in drug screening and development. Biomarkers and uses thereof are valuable for identification of new drug treatments and for discovery of new targets for drug treatment. As used herein, the term “biomarker” refers to a molecule that is associated either quantitatively or qualitatively with a biological change. Examples of biomarkers include polypeptides, proteins or fragments of a polypeptide or protein; and polynucleotides, such as a gene product, RNA or RNA fragment; and other body metabolites. In certain embodiments, a “biomarker” means a compound that is differentially present (i.e., increased or decreased) in a biological sample from a subject or a group of subjects having a first phenotype (e.g., having a disease or condition) as compared to a biological sample from a subject or group of subjects having a second phenotype (e.g., not having the disease or condition or having a less severe version of the disease or condition). A biomarker may be differentially present at any level, but is generally present at a level that is increased by at least 5%, by at least 10%, by at least 15%, by at least 20%, by at least 25%, by at least 30%, by at least 35%, by at least 40%, by at least 45%, by at least 50%, by at least 55%, by at least 60%, by at least 65%, by at least 70%, by at least 75%, by at least 80%, by at least 85%, by at least 90%, by at least 95%, by at least 100%, by at least 110%, by at least 120%, by at least 130%, by at least 140%, by at least 150%, or more; or is generally present at a level that is decreased by at least 5%, by at least 10%, by at least 15%, by at least 20%, by at least 25%, by at least 30%, by at least 35%, by at least 40%, by at least 45%, by at least 50%, by at least 55%, by at least 60%, by at least 65%, by at least 70%, by at least 75%, by at least 80%, by at least 85%, by at least 90%, by at least 95%, or by 100% (i.e., absent). A biomarker is preferably differentially present at a level that is statistically significant (e.g., a p-value less than 0.05 and/or a q-value of less than 0.10 as determined using, for example, either Welch's T-test or Wilcoxon's rank-sum Test).
In addition, the term “biomarker” also includes the isoforms and/or post-translationally modified forms of any of the foregoing. The present invention contemplates the detection, measurement, quantification, determination and the like of both unmodified and modified (e.g., glycosylation, citrullination, phosphorylation, oxidation or other post-translational modification) proteins/polypeptides/peptides. In certain embodiments, it is understood that reference to the detection, measurement, determination, and the like, of a biomarker refers detection of the protein/polypeptide/peptide (modified and/or unmodified).
The term “cancer” as used herein is meant, a disease, condition, trait, genotype or phenotype characterized by unregulated cell growth or replication as is known in the art; including lung cancer (including non-small cell lung carcinoma), gastric cancer, colorectal cancer, as well as, for example, leukemias, e.g., acute myelogenous leukemia (AML), chronic myelogenous leukemia (CML), acute lymphocytic leukemia (ALL), and chronic lymphocytic leukemia, AIDS related cancers such as Kaposi's sarcoma; breast cancers; bone cancers such as Osteosarcoma, Chondrosarcomas, Ewing's sarcoma, Fibrosarcomas, Giant cell tumors, Adamantinomas, and Chordomas; Brain cancers such as Meningiomas, Glioblastomas, Lower-Grade Astrocytomas, Oligodendrocytomas, Pituitary Tumors, Schwannomas, and Metastatic brain cancers; cancers of the head and neck including various lymphomas such as mantle cell lymphoma, non-Hodgkins lymphoma, adenoma, squamous cell carcinoma, laryngeal carcinoma, gallbladder and bile duct cancers, cancers of the retina such as retinoblastoma, cancers of the esophagus, gastric cancers, multiple myeloma, ovarian cancer, uterine cancer, thyroid cancer, testicular cancer, endometrial cancer, melanoma, bladder cancer, prostate cancer, pancreatic cancer, sarcomas, Wilms' tumor, cervical cancer, head and neck cancer, skin cancers, nasopharyngeal carcinoma, liposarcoma, epithelial carcinoma, renal cell carcinoma, gallbladder adeno carcinoma, parotid adenocarcinoma, endometrial sarcoma, multidrug resistant cancers; and proliferative diseases and conditions, such as neovascularization associated with tumor angiogenesis.
The term “candidate variant,” “called variant,” or “putative variant” refers to one or more detected nucleotide variants of a nucleotide sequence, for example, at a position in the genome that is determined to be mutated. Generally, a nucleotide base is deemed a called variant based on the presence of an alternative allele on sequence reads obtained from a sample, where the sequence reads each cross over the position in the genome. The source of a candidate variant may initially be unknown or uncertain. During processing, candidate variants may be associated with an expected source such as genomic DNA (e.g., blood-derived) or cells impacted by cancer (e.g., tumor-derived). Additionally, candidate variants may be called as true positives. A variant of interest is particular variant of a genetic sequence that is to be measured, qualified, quantified, or detected. In some implementations, a variant of interest is a variant known or suspected to be associated with a condition, such as a cancer, a tumor, or a genetic disorder.
The term “cell free nucleic acid,” “cell free DNA,” or “cfDNA” refers to nucleic acid fragments that circulate in an individual's body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancer cells. Additionally cfDNA may come from other sources such as viruses, fetuses, etc.
The term “circulating tumor DNA” or “ctDNA” refers to nucleic acid fragments that originate from tumor cells or other types of cancer cells, which may be released into an individual's bloodstream as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
As used herein, the terms “comprising,” “comprise” or “comprised,” and variations thereof, in reference to defined or described elements of an item, composition, apparatus, method, process, system, etc. are meant to be inclusive or open ended, permitting additional elements, thereby indicating that the defined or described item, composition, apparatus, method, process, system, etc. includes those specified elements—or, as appropriate, equivalents thereof—and that other elements can be included and still fall within the scope/definition of the defined item, composition, apparatus, method, process, system, etc.
“Diagnostic” or “diagnosed” means identifying the presence or nature of a pathologic condition. Diagnostic methods differ in their sensitivity and specificity. The “sensitivity” of a diagnostic assay is the percentage of diseased individuals who test positive (percent of “true positives”). Diseased individuals not detected by the assay are “false negatives.” Subjects who are not diseased and who test negative in the assay, are termed “true negatives.” The “specificity” of a diagnostic assay is 1 minus the false positive rate, where the “false positive” rate is defined as the proportion of those without the disease who test positive. While a particular diagnostic method may not provide a definitive diagnosis of a condition, it suffices if the method provides a positive indication that aids in diagnosis.
An “effective amount” as used herein, means an amount which provides a therapeutic or prophylactic benefit.
As used herein, the terms “fragmentation profile,” “position dependent differences in fragmentation patterns,” and “differences in fragment size and coverage in a position dependent manner across the genome” are equivalent and can be used interchangeably. As used herein, the terms “fragmentation profile,” “position dependent differences in fragmentation patterns,” and “differences in fragment size and coverage in a position dependent manner across the genome” are equivalent and can be used interchangeably. In some embodiments, determining a cfDNA fragmentation profile in a mammal can be used for identifying a mammal as having cancer. For example, cfDNA fragments obtained from a mammal (e.g., from a sample obtained from a mammal) can be subjected to low coverage whole-genome sequencing, and the sequenced fragments can be mapped to the genome (e.g., in non-overlapping windows) and assessed to determine a cfDNA fragmentation profile. As described herein, a cfDNA fragmentation profile of a mammal having cancer is more heterogeneous (e.g., in fragment lengths) than a cfDNA fragmentation profile of a healthy mammal (e.g., a mammal not having cancer). As such, this disclosure also provides methods and materials for assessing, monitoring, and/or treating mammals (e.g., humans) having, or suspected of having, cancer. In some embodiments, this document provides methods and materials for identifying a mammal as having cancer. For example, a sample (e.g., a blood sample) obtained from a mammal can be assessed to determine the presence and, optionally, the tissue of origin of the cancer in the mammal based, at least in part, on the cfDNA fragmentation profile of the mammal. In some embodiments, methods and materials for monitoring a mammal as having cancer are provided. For example, a sample (e.g., a blood sample) obtained from a mammal can be assessed to determine the presence of the cancer in the mammal based, at least in part, on the cfDNA fragmentation profile of the mammal. In some embodiments, methods and materials for identifying a mammal as having cancer, and administering one or more cancer treatments to the mammal to treat the mammal are provided. For example, a sample (e.g., a blood sample) obtained from a mammal can be assessed to determine if the mammal has cancer based, at least in part, on the cfDNA fragmentation profile of the mammal, and one or more cancer treatments can be administered to the mammal.
The term “genomic nucleic acid,” or “genomic DNA,” refers to nucleic acid including chromosomal DNA that originates from one or more healthy (e.g., non-tumor) cells. In various embodiments, genomic DNA can be extracted from a cell derived from a blood cell lineage, such as a white blood cell (WBC).
“Optional” or “optionally” means that the subsequently described event or circumstance can or cannot occur, and that the description includes instances where the event or circumstance occurs and instances where it does not.
As used in this specification and the appended claims, the term “or” is generally employed in its sense including “and/or” unless the content clearly dictates otherwise.
“Parenteral” administration of an immunogenic composition includes, e.g., subcutaneous (s.c.), intravenous (i.v.), intramuscular (i.m.), or intrasternal injection, or infusion techniques.
The terms “patient” or “individual” or “subject” are used interchangeably herein, and refers to a mammalian subject to be treated, with human patients being preferred. In some embodiments, the methods of the invention find use in experimental animals, in veterinary application, and in the development of animal models for disease, including, but not limited to, rodents including mice, rats, and hamsters, and primates.
The term “reference genome” as used herein may refer to a digital or previously identified nucleic acid sequence database, assembled as a representative example of a species or subject. Reference genomes may be assembled from the nucleic acid sequences from multiple subjects, sample or organisms and does not necessarily represent the nucleic acid makeup of a single person. Reference genomes may be used to for mapping of sequencing reads from a sample to chromosomal positions. For example, a reference genome used for human subjects as well as many other organisms is found at the National Center for Biotechnology Information at ncbi.nlm.nih.gov.
The term “read segment” or “read” refers to any nucleotide sequences including sequence reads obtained from an individual and/or nucleotide sequences derived from the initial sequence read from a sample obtained from an individual.
The terms “sample,” “patient sample,” “biological sample,” and the like, encompass a variety of sample types obtained from a patient, individual, or subject and can be used in a diagnostic, prognostic and/or monitoring assay. The patient sample may be obtained from a healthy subject, a diseased patient, or a patient with lung cancer. In certain embodiments, a sample that is “provided” can be obtained by the person (or machine) conducting the assay, or it can have been obtained by another, and transferred to the person (or machine) carrying out the assay. Moreover, a sample obtained from a patient can be divided and only a portion may be used for diagnosis. Further, the sample, or a portion thereof, can be stored under conditions to maintain sample for later analysis. The definition specifically encompasses blood and other liquid samples of biological origin (including, but not limited to, peripheral blood, serum, plasma, cord blood, amniotic fluid, cerebrospinal fluid, urine, saliva, stool and synovial fluid), solid tissue samples such as a biopsy specimen or tissue cultures or cells derived therefrom and the progeny thereof. In certain embodiment, a sample comprises cerebrospinal fluid. In a specific embodiment, a sample comprises a blood sample. In another embodiment, a sample comprises a plasma sample. In yet another embodiment, a serum sample is used. The definition of “sample” also includes samples that have been manipulated in any way after their procurement, such as by centrifugation, filtration, precipitation, dialysis, chromatography, treatment with reagents, washed, or enriched for certain cell populations. The terms further encompass a clinical sample, and also include cells in culture, cell supernatants, tissue samples, organs, and the like. Samples may also comprise fresh-frozen and/or formalin-fixed, paraffin-embedded tissue blocks, such as blocks prepared from clinical or pathological biopsies, prepared for pathological analysis or study by immunohistochemistry.
The term “sequence reads” refers to nucleotide sequences read from a sample obtained from an individual. Sequence reads can be obtained through various methods known in the art.
As defined herein, a “therapeutically effective” amount of a compound or agent (i.e., an effective dosage) means an amount sufficient to produce a therapeutically (e.g., clinically) desirable result. The compositions can be administered from one or more times per day to one or more times per week; including once every other day. The skilled artisan will appreciate that certain factors can influence the dosage and timing required to effectively treat a subject, including but not limited to the severity of the disease or disorder, previous treatments, the general health and/or age of the subject, and other diseases present. Moreover, treatment of a subject with a therapeutically effective amount of the compounds of the invention can include a single treatment or a series of treatments.
As used herein, the terms “treat,” treating,” “treatment,” and the like refer to reducing or ameliorating a disorder and/or symptoms associated therewith. It will be appreciated that, although not precluded, treating a disorder or condition does not require that the disorder, condition or symptoms associated therewith be completely eliminated.
Genes: All genes, gene names, and gene products disclosed herein are intended to correspond to homologs from any species for which the compositions and methods disclosed herein are applicable. It is understood that when a gene or gene product from a particular species is disclosed, this disclosure is intended to be exemplary only, and is not to be interpreted as a limitation unless the context in which it appears clearly indicates. Thus, for example, for the genes or gene products disclosed herein, are intended to encompass homologous and/or orthologous genes and gene products from other species.
Ranges: throughout this disclosure, various aspects of the invention can be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 2.7, 3, 4, 5, 5.3, and 6. This applies regardless of the breadth of the range.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawings will be provided by the Office upon request and payment of the necessary fee.
There is an urgent unmet clinical need for the development of non-invasive approaches to improve standard of care cancer screening that can increase accessibility among high-risk individuals and ultimately the general population. Biomarker development for the early detection of lung cancer has potential clinical applications in screening as well as for discriminating malignancy in round opacities identified as nodules on chest imaging studies8. Investigation of proteins9-11, autoantibodies12, gene expression profiles13 and microRNAs14 in the blood or airway epithelium have yielded promising biomarker candidates for early detection of lung cancer although some may be affected by age as well as systemic inflammation induced by prolonged exposure to smoking and other conditions, and none are yet approved for clinical use14.
The rapid technological and analytical advancements in liquid biopsy analyses have identified cancer-related features in the cfDNA compartment of blood and provided a new avenue for early detection of cancer. Mutations in circulating tumor DNA (ctDNA) can be directly detected in early stage lung cancer patients without prior knowledge of these alterations in tumors15-18. However, given the relatively small number of sequence alterations that can be assessed by targeted high coverage sequencing, many individuals with cancer may be missed by such approaches and may also require sequencing of white blood cells (WBCs) to eliminate changes that result from clonal hematopoiesis16,17,19. To increase the sensitivity of detection of early stage cancers, a genome-wide approach was developed using machine learning for analysis of cfDNA fragmentation profiles called DELFI (DNA evaluation of fragments for early interception)20. This multi-feature analysis permits evaluation of millions of cfDNA fragments simultaneously, increasing number of tumor-derived epigenomic and genomic changes that can be detected. In this study, the methodology was improved and applied to or lung cancer detection in a prospectively collected diagnostic cohort comprising patients with lung cancer as well as non-cancer individuals. It is also disclosed herein, the evaluation of the combining this methodology with plasma protein biomarkers and blood cell counts, thereby examining genomic, epigenomic, protein, and cellular features for early cancer detection. Through this effort, a clinical framework is provided by which a non-invasive liquid biopsy approach could be incorporated in the clinic, combining the DELFI with other markers and low dose helical computed tomography (LDCT) for early lung cancer detection.
DNA Evaluation of Fragments for early Interception (DELFI) was previously developed, Cristiano S, Leal A, Phallen J, et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 2019; 570:385-9 incorporated herein in its entirety, and used to evaluate genome-wide fragmentation patterns of cfDNA of 236 patients with breast, colorectal, lung, ovarian, pancreatic, gastric, or bile duct cancers as well as 245 healthy individuals. These analyses revealed that cfDNA profiles of healthy individuals reflected nucleosomal fragmentation patterns of white blood cells, while patients with cancer had altered fragmentation profiles. DELFI had sensitivities of detection ranging from 57% to >99% among the seven cancer types at 98% specificity and identified the tissue of origin of the cancers to a limited number of sites in 75% of embodiments. Assessing cfDNA (e.g., using DELFI) provide a screening approach for early detection of cancer, which can increase the chance for successful treatment of a patient having cancer. Assessing cfDNA (e.g., using DELFI) can also provide an approach for monitoring cancer, which can increase the chance for successful treatment and improved outcome of a patient having cancer. In addition, a cfDNA fragmentation profile can be obtained from limited amounts of cfDNA and using inexpensive reagents and/or instruments.
Accordingly, in certain embodiments, a method of diagnosing cancer in a subject, comprises extracting cell free (cfDNA) from the subject's biological sample; generating genomic libraries from the extracted cfDNA and whole genome sequencing of cfDNA fragments; mapping of the cfDNA fragments to a genomic origin and evaluating fragment length and obtaining genome-wide fragmentation profiles for each sample; identifying protein biomarkers of the subject; comparing the subject's cfDNA fragmentation profile and protein biomarkers with normal reference non-cancer subjects. In certain embodiments, the cancer is lung cancer.
In certain embodiments, the method further comprises subjecting the subject to a low dose helical computed tomography (LDCT). In certain embodiments, the method further comprises comparing clinical data between the subject diagnosed as having lung cancer and normal non-cancer subjects. In certain embodiments, the cfDNA fragment mean length and profiles are similar among non-cancer individuals. In certain embodiments, the cfDNA fragment profiles of cancer subjects vary. In certain embodiments, the serum levels of or one or more tumor antigens, cytokines or proteins are measured.
In certain embodiments a DELFI score is generated, wherein the principle component analysis is incorporated into a machine learning predictive model to generate a score for each subject as an average over cross-validation repeats (DELFI score(s)). In certain embodiments, the DELFI scores for non-cancer individuals are less than about 0.3. In certain embodiments, the DELFI scores for stage I cancer are between about 0.3 to less than 0.5. In certain embodiments, the DELFI scores for stage II cancer are between about 0.5 to less than 0.8. In certain embodiments, the DELFI scores for stage III cancer are between about 0.8 to less than 0.99. In certain embodiments, the DELFI scores for stage IV cancer are about 0.99 or greater. In certain embodiments, the DELFI score for stage I cancer is about 0.35. In certain embodiments, the DELFI score for stage II cancer is about 0.75. In certain embodiments, the DELFI score for stage III cancer is about 0.9. In certain embodiments, the DELFI score for stage IV cancer is about 0.99.
cfDNA Fragmentation Profiles: A cfDNA fragmentation profile can include one or more cfDNA fragmentation patterns. A cfDNA fragmentation pattern can include any appropriate cfDNA fragmentation pattern. Examples of cfDNA fragmentation patterns include, without limitation, median fragment size, fragment size distribution, ratio of small cfDNA fragments to large cfDNA fragments, and the coverage of cfDNA fragments. In some embodiments, a cfDNA fragmentation pattern includes two or more (e.g., two, three, or four) of median fragment size, fragment size distribution, ratio of small cfDNA fragments to large cfDNA fragments, and the coverage of cfDNA fragments. In some embodiments, cfDNA fragmentation profile can be a genome-wide cfDNA profile (e.g., a genome-wide cfDNA profile in windows across the genome). In some embodiments, cfDNA fragmentation profile can be a targeted region profile. A targeted region can be any appropriate portion of the genome (e.g., a chromosomal region). Examples of chromosomal regions for which a cfDNA fragmentation profile can be determined as described herein include, without limitation, a portion of a chromosome (e.g., a portion of 2q, 4p, 5p, 6q, 7p, 8q, 9q, 10q, 11q, 12q, and/or 14q) and a chromosomal arm (e.g., a chromosomal arm of 8q, 13q, 11q, and/or 3p). In some embodiments, a cfDNA fragmentation profile can include two or more targeted region profiles.
In some embodiments, a cfDNA fragmentation profile can be used to identify changes (e.g., alterations) in cfDNA fragment lengths. An alteration can be a genome-wide alteration or an alteration in one or more targeted regions/loci. A target region can be any region containing one or more cancer-specific alterations. In some embodiments, a cfDNA fragmentation profile can be used to identify (e.g., simultaneously identify) from about 10 alterations to about 500 alterations (e.g., from about 25 to about 500, from about 50 to about 500, from about 100 to about 500, from about 200 to about 500, from about 300 to about 500, from about 10 to about 400, from about 10 to about 300, from about 10 to about 200, from about 10 to about 100, from about 10 to about 50, from about 20 to about 400, from about 30 to about 300, from about 40 to about 200, from about 50 to about 100, from about 20 to about 100, from about 25 to about 75, from about 50 to about 250, or from about 100 to about 200, alterations).
In some embodiments, a cfDNA fragmentation profile can be used to detect tumor-derived DNA. For example, a cfDNA fragmentation profile can be used to detect tumor-derived DNA by comparing a cfDNA fragmentation profile of a mammal having, or suspected of having, cancer to a reference cfDNA fragmentation profile (e.g., a cfDNA fragmentation profile of a healthy mammal and/or a nucleosomal DNA fragmentation profile of healthy cells from the mammal having, or suspected of having, cancer). In some embodiments, a reference cfDNA fragmentation profile is a previously generated profile from a healthy mammal. For example, methods provided herein can be used to determine a reference cfDNA fragmentation profile in a healthy mammal, and that reference cfDNA fragmentation profile can be stored (e.g., in a computer or other electronic storage medium) for future comparison to a test cfDNA fragmentation profile in mammal having, or suspected of having, cancer. In some embodiments, a reference cfDNA fragmentation profile (e.g., a stored cfDNA fragmentation profile) of a healthy mammal is determined over the whole genome. In some embodiments, a reference cfDNA fragmentation profile (e.g., a stored cfDNA fragmentation profile) of a healthy mammal is determined over a subgenomic interval.
In some embodiments, a cfDNA fragmentation profile can be used to identify a mammal (e.g., a human) as having cancer (e.g., a colorectal cancer, a lung cancer, a breast cancer, a gastric cancer, a pancreatic cancer, a bile duct cancer, and/or an ovarian cancer).
A cfDNA fragmentation profile can include a cfDNA fragment size pattern. cfDNA fragments can be any appropriate size. For example, cfDNA fragment can be from about 50 base pairs (bp) to about 400 bp in length. As described herein, a mammal having cancer can have a cfDNA fragment size pattern that contains a shorter median cfDNA fragment size than the median cfDNA fragment size in a healthy mammal A healthy mammal (e.g., a mammal not having cancer) can have cfDNA fragment sizes having a median cfDNA fragment size from about 166.6 bp to about 167.2 bp (e.g., about 166.9 bp). In some embodiments, a mammal having cancer can have cfDNA fragment sizes that are, on average, about 1.28 bp to about 2.49 bp (e.g., about 1.88 bp) shorter than cfDNA fragment sizes in a healthy mammal. For example, a mammal having cancer can have cfDNA fragment sizes having a median cfDNA fragment size of about 164.11 bp to about 165.92 bp (e.g., about 165.02 bp).
A cfDNA fragmentation profile can include a cfDNA fragment size distribution. As described herein, a mammal having cancer can have a cfDNA size distribution that is more variable than a cfDNA fragment size distribution in a healthy mammal. In some embodiments, a size distribution can be within a targeted region. A healthy mammal (e.g., a mammal not having cancer) can have a targeted region cfDNA fragment size distribution of about 1 or less than about 1. In some embodiments, a mammal having cancer can have a targeted region cfDNA fragment size distribution that is longer (e.g., 10, 15, 20, 25, 30, 35, 40, 45, 50 or more bp longer, or any number of base pairs between these numbers) than a targeted region cfDNA fragment size distribution in a healthy mammal. In some embodiments, a mammal having cancer can have a targeted region cfDNA fragment size distribution that is shorter (e.g., 10, 15, 20, 25, 30, 35, 40, 45, 50 or more bp shorter, or any number of base pairs between these numbers) than a targeted region cfDNA fragment size distribution in a healthy mammal. In some embodiments, a mammal having cancer can have a targeted region cfDNA fragment size distribution that is about 47 bp smaller to about 30 bp longer than a targeted region cfDNA fragment size distribution in a healthy mammal. In some embodiments, a mammal having cancer can have a targeted region cfDNA fragment size distribution of, on average, a 10, 11, 12, 13, 14, 15, 15, 17, 18, 19, 20 or more bp difference in lengths of cfDNA fragments. For example, a mammal having cancer can have a targeted region cfDNA fragment size distribution of, on average, about a 13 bp difference in lengths of cfDNA fragments. In some embodiments, a size distribution can be a genome-wide size distribution. A healthy mammal (e.g., a mammal not having cancer) can have very similar distributions of short and long cfDNA fragments genome-wide. In some embodiments, a mammal having cancer can have, genome-wide, one or more alterations (e.g., increases and decreases) in cfDNA fragment sizes. The one or more alterations can be any appropriate chromosomal region of the genome. For example, an alteration can be in a portion of a chromosome. Examples of portions of chromosomes that can contain one or more alterations in cfDNA fragment sizes include, without limitation, portions of 2q, 4p, 5p, 6q, 7p, 8q, 9q, 10q, 11q, 12q, and 14q. For example, an alteration can be across a chromosome arm (e.g., an entire chromosome arm).
A cfDNA fragmentation profile can include a ratio of small cfDNA fragments to large cfDNA fragments and a correlation of fragment ratios to reference fragment ratios. As used herein, with respect to ratios of small cfDNA fragments to large cfDNA fragments, a small cfDNA fragment can be from about 100 bp in length to about 150 bp in length. As used herein, with respect to ratios of small cfDNA fragments to large cfDNA fragments, a large cfDNA fragment can be from about 151 bp in length to 220 bp in length. As described herein, a mammal having cancer can have a correlation of fragment ratios (e.g., a correlation of cfDNA fragment ratios to reference DNA fragment ratios such as DNA fragment ratios from one or more healthy mammals) that is lower (e.g., 2-fold lower, 3-fold lower, 4-fold lower, 5-fold lower, 6-fold lower, 7-fold lower, 8-fold lower, 9-fold lower, 10-fold lower, or more) than in a healthy mammal. A healthy mammal (e.g., a mammal not having cancer) can have a correlation of fragment ratios (e.g., a correlation of cfDNA fragment ratios to reference DNA fragment ratios such as DNA fragment ratios from one or more healthy mammals) of about 1 (e.g., about 0.96). In some embodiments, a mammal having cancer can have a correlation of fragment ratios (e.g., a correlation of cfDNA fragment ratios to reference DNA fragment ratios such as DNA fragment ratios from one or more healthy mammals) that is, on average, about 0.19 to about 0.30 (e.g., about 0.25) lower than a correlation of fragment ratios (e.g., a correlation of cfDNA fragment ratios to reference DNA fragment ratios such as DNA fragment ratios from one or more healthy mammals) in a healthy mammal.
A cfDNA fragmentation profile can include coverage of all fragments. Coverage of all fragments can include windows (e.g., non-overlapping windows) of coverage. In some embodiments, coverage of all fragments can include windows of small fragments (e.g., fragments from about 100 bp to about 150 bp in length). In some embodiments, coverage of all fragments can include windows of large fragments (e.g., fragments from about 151 bp to about 220 bp in length).
A cfDNA fragmentation profile can be obtained using any appropriate method. In some embodiments, cfDNA from a mammal (e.g., a mammal having, or suspected of having, cancer) can be processed into sequencing libraries which can be subjected to whole genome sequencing (e.g., low-coverage whole genome sequencing), mapped to the genome, and analyzed to determine cfDNA fragment lengths. Mapped sequences can be analyzed in non-overlapping windows covering the genome. Windows can be any appropriate size. For example, windows can be from thousands to millions of bases in length. As one non-limiting example, a window can be about 5 megabases (Mb) long. Any appropriate number of windows can be mapped. For example, tens to thousands of windows can be mapped in the genome. For example, hundreds to thousands of windows can be mapped in the genome. A cfDNA fragmentation profile can be determined within each window.
In some embodiments, methods and materials described herein also can include machine learning. For example, machine learning can be used for identifying an altered fragmentation profile (e.g., using coverage of cfDNA fragments, fragment size of cfDNA fragments, coverage of chromosomes, and mtDNA).
In certain embodiments, detection of one or more biomarkers from patients are combined with DELFI as described in detail in the examples section which follows. In certain embodiments, the serum levels of or one or more tumor antigens, cytokines or proteins are measured, compared etc.
As used herein, the terms “comparing” or “comparison” refers to making an assessment of how the proportion, level or cellular localization of one or more biomarkers in a sample from a patient relates to the proportion, level or cellular localization of the corresponding one or more biomarkers in a standard or control sample. For example, “comparing” may refer to assessing whether the proportion, level, or cellular localization of one or more biomarkers in a sample from a patient is the same as, more or less than, or different from the proportion, level, or cellular localization of the corresponding one or more biomarkers in standard or control sample. More specifically, the term may refer to assessing whether the proportion, level, or cellular localization of one or more biomarkers in a sample from a patient is the same as, more or less than, different from or otherwise corresponds (or not) to the proportion, level, or cellular localization of predefined biomarker levels/ratios that correspond to, for example, a patient having lung cancer, not having lung cancer, is responding to treatment for myocardial injury, is not responding to treatment for myocardial injury, is/is not likely to respond to a particular myocardial injury treatment, or having/not having another disease or condition. In a specific embodiment, the term “comparing” refers to assessing whether the level of one or more biomarkers of the present invention in a sample from a patient is the same as, more or less than, different from other otherwise correspond (or not) to levels/ratios of the same biomarkers in a control sample (e.g., predefined levels/ratios that correlate to healthy individuals, lung cancer levels/ratios, etc.). In another embodiment, the terms “comparing” or “comparison” refers to making an assessment of how the proportion, level or cellular localization of one or more biomarkers in a sample from a patient relates to the proportion, level or cellular localization of another biomarker in the same sample. For example, a ratio of one biomarker to another from the same patient sample can be compared. In another embodiment, a level of one biomarker in a sample (e.g., a post-translationally modified biomarker protein) can be compared to the level of the same biomarker (e.g., unmodified biomarker protein) in the sample. Ratios of modified:unmodified biomarker proteins can be compared to other protein ratios in the same sample or to predefined reference or control ratios.
In certain embodiments in which the relationship of the biomarkers are described in terms of a ratio, the ratio can include 1-fold, 2-, 3-, 4-, 5-, 6-, 7-, 8-, 9-, 10-, 11-, 12-, 13-, 14-, 15-, 16-, 17-, 18-, 19-, 20-, 21-, 22-, 23-, 24-, 25-, 26-, 27-, 28-, 29-, 30-, 31-, 32-, 33-, 34-, 35-, 36-, 37-, 38-, 39-, 40-, 41-, 42-, 43-, 44-, 45-, 46-, 47-, 48-, 49-, 50-, 51-, 52-, 53-, 54-, 55-, 56-, 57-, 58-, 59-, 60-, 61-, 62-, 63-, 64-, 65-, 66-, 67-, 68-, 69-, 70-, 71-, 72-, 73-, 74-, 75-, 76-, 77-, 78-, 79-, 80-, 81-, 82-, 83-, 84-, 85-, 86-, 87-, 88-, 89-, 90-, 91-, 92-, 93-, 94-, 95-, 96-, 97-, 98-, 99-, 100-fold or more difference (higher or lower). Alternatively, the difference can include 0.9-fold, 0.8-fold, 0.7-fold, 0.7-fold, 0.6-fold, 0.5-fold, 0.4-fold, 0.3-fold, 0.2-fold, and 0.1-fold (higher or lower) depending on context. The foregoing can also be expressed in terms of a range (e.g., 1-5 fold/times higher or lower) or a threshold (e.g., at least 2-fold/times higher or lower).
In other embodiments, a particular set or pattern of the amounts of one or more biomarkers may be correlated to a patient being unaffected (i.e., indicates a patient does not have cancer, e.g. lung cancer). In certain embodiments, “indicating,” or “correlating,” as used according to the present disclosure, may be by any linear or non-linear method of quantifying the relationship between levels/ratios of biomarkers to a standard, control or comparative value for the assessment of the diagnosis, prediction of cancer or cancer progression, assessment of efficacy of clinical treatment, identification of a patient that may respond to a particular treatment regime or pharmaceutical agent, monitoring of the progress of treatment, and in the context of a screening assay, for the identification of an anti-cancer therapeutics.
In certain embodiments, the biomarkers detected are tumor antigens. In certain embodiments, the one or more tumor antigens comprise: carcinoembryonic antigen (CEA), CA19-9, CA 125, tissue polypeptideantigen (TSA), CYFRA-21-1, neuron-specific enolase, progastrin-releasing peptide (ProGRP), plasma kalikrein B1 (KLKB1), serum amyloid A, haptoglobin-alpha-2, ADAM-17, osteoprotegerin, pentraxin 3, follistatin, tumor necrosis factor receptor superfamily member 1A or combinations thereof.
In certain embodiments, the one or more proteins comprise C-reactive protein (CRP), Chitinase-3-like protein 1 (YKL-40/CHI3L1) or fragments thereof.
Other tumor antigens include (note, the cancer indications indicated represent non-limiting examples): aminopeptidase N (CD13), annexin Al, B7-H3 (CD276, various cancers), CA125 (ovarian cancers), CA15-3 (carcinomas), CA19-9 (carcinomas), L6 (carcinomas), Lewis Y (carcinomas), Lewis X (carcinomas), alpha fetoprotein (carcinomas), CA242 (colorectal cancers), placental alkaline phosphatase (carcinomas), prostate specific antigen (prostate), prostatic acid phosphatase (prostate), epidermal growth factor (carcinomas), CD2 (Hodgkin's disease, NHL lymphoma, multiple myeloma), CD3 epsilon (T cell lymphoma, lung, breast, gastric, ovarian cancers, autoimmune diseases, malignant ascites), CD19 (B cell malignancies), CD20 (non-Hodgkin's lymphoma, B-cell neoplasmas, autoimmune diseases), CD21 (B-cell lymphoma), CD22 (leukemia, lymphoma, multiple myeloma, SLE), CD30 (Hodgkin's lymphoma), CD33 (leukemia, autoimmune diseases), CD38 (multiple myeloma), CD40 (lymphoma, multiple myeloma, leukemia (CLL)), CD51 (metastatic melanoma, sarcoma), CD52 (leukemia), CD56 (small cell lung cancers, ovarian cancer, Merkel cell carcinoma, and the liquid tumor, multiple myeloma), CD66e (carcinomas), CD70 (metastatic renal cell carcinoma and non-Hodgkin lymphoma), CD74 (multiple myeloma), CD80 (lymphoma), CD98 (carcinomas), CD123 (leukemia), mucin (carcinomas), CD221 (solid tumors), CD227 (breast, ovarian cancers), CD262 (NSCLC and other cancers), CD309 (ovarian cancers), CD326 (solid tumors), CEACAM3 (colorectal, gastric cancers), CEACAM5 (CEA, CD66e) (breast, colorectal and lung cancers), DLL4 (A-like-4), EGFR (various cancers), CTLA4 (melanoma), CXCR4 (CD 184, heme-oncology, solid tumors), Endoglin (CD 105, solid tumors), EPCAM (epithelial cell adhesion molecule, bladder, head, neck, colon, NHL prostate, and ovarian cancers), ERBB2 (lung, breast, prostate cancers), FCGR1 (autoimmune diseases), FOLR (folate receptor, ovarian cancers), FGFR (carcinomas), GD2 ganglioside (carcinomas), G-28 (a cell surface antigen glycolipid, melanoma), GD3 idiotype (carcinomas), heat shock proteins (carcinomas), HER1 (lung, stomach cancers), HER2 (breast, lung and ovarian cancers), HLA-DR10 (NHL), HLA-DRB (NHL, B cell leukemia), human chorionic gonadotropin (carcinomas), IGF1R (solid tumors, blood cancers), IL-2 receptor (T-cell leukemia and lymphomas), IL-6R (multiple myeloma, RA, Castleman's disease, IL6 dependent tumors), integrins (αvβ3, α5β1, α6β4, α11β3, α5β5, αvβ5, for various cancers), MAGE-1 (carcinomas), MAGE-2 (carcinomas), MAGE-3 (carcinomas), MAGE 4 (carcinomas), anti-transferrin receptor (carcinomas), p97 (melanoma), MS4A1 (membrane-spanning 4-domains subfamily A member 1, Non-Hodgkin's B cell lymphoma, leukemia), MUC1 (breast, ovarian, cervix, bronchus and gastrointestinal cancer), MUC16 (CA125) (ovarian cancers), CEA (colorectal cancer), gp100 (melanoma), MARTI (melanoma), MPG (melanoma), MS4A1 (membrane-spanning 4-domains subfamily A, small cell lung cancers, NHL), nucleolin, Neu oncogene product (carcinomas), P21 (carcinomas), nectin-4 (carcinomas), paratope of anti-(N-glycolylneuraminic acid, breast, melanoma cancers), PLAP-like testicular alkaline phosphatase (ovarian, testicular cancers), PSMA (prostate tumors), PSA (prostate), ROB04, TAG 72 (tumor associated glycoprotein 72, AML, gastric, colorectal, ovarian cancers), T cell transmembrane protein (cancers), Tie (CD202b), tissue factor, TNFRSF10B (tumor necrosis factor receptor superfamily member 10B, carcinomas), TNFRSF13B (tumor necrosis factor receptor superfamily member 13B, multiple myeloma, NHL, other cancers, RA and SLE), TPBG (trophoblast glycoprotein, renal cell carcinoma), TRAIL-R1 (tumor necrosis apoptosis inducing ligand receptor 1, lymphoma, NHL, colorectal, lung cancers), VCAM-1 (CD106, Melanoma), VEGF, VEGF-A, VEGF-2 (CD309) (various cancers). Some other tumor associated antigens have been reviewed (Gerber, et al, mAbs 2009 1:247-253; Novellino et al, Cancer Immunol Immunother. 2005 54:187-207, Franke, et al, Cancer Biother Radiopharm. 2000, 15:459-76, Guo, et al., Adv Cancer Res. 2013; 119: 421-475, Parmiani et al. J Immunol. 2007 178:1975-9). Examples of these antigens include Cluster of Differentiations (CD4, CD5, CD6, CD7, CD8, CD9, CD10, CD11a, CD11b, CD11c, CD12w, CD14, CD15, CD16, CDwl7, CD18, CD21, CD23, CD24, CD25, CD26, CD27, CD28, CD29, CD31, CD32, CD34, CD35, CD36, CD37, CD41, CD42, CD43, CD44, CD45, CD46, CD47, CD48, CD49b, CD49c, CD53, CD54, CD55, CD58, CD59, CD61, CD62E, CD62L, CD62P, CD63, CD68, CD69, CD71, CD72, CD79, CD81, CD82, CD83, CD86, CD87, CD88, CD89, CD90, CD91, CD95, CD96, CD100, CD103, CD105, CD106, CD109, CD117, CD120, CD127, CD133, CD134, CD135, CD138, CD141, CD142, CD143, CD144, CD147, CD151, CD152, CD154, CD156, CD158, CD163, CD166, .CD168, CD184, CDwl86, CD195, CD202 (a, b), CD209, CD235a, CD271, CD303, CD304), annexin Al, nucleolin, endoglin (CD105), ROB04, amino-peptidase N, -like-4 (DLL4), VEGFR-2 (CD309), CXCR4 (CD184), Tie2, B7-H3, WT1, MUC1, LMP2, HPV E6 E7, EGFRvIII, HER-2/neu, idiotype, MAGE A3, p53 nonmutant, NY-ESO-1, GD2, CEA, MelanA/MART1, Ras mutant, gp100, p53 mutant, proteinase3 (PR1), bcr-abl, tyrosinase, survivin, hTERT, sarcoma translocation breakpoints, EphA2, PAP, ML-IAP, AFP, EpCAM, ERG (TMPRSS2 ETS fusion gene), NA17, PAX3, ALK, androgen receptor, cyclin B 1, polysialic acid, MYCN, RhoC, TRP-2, GD3, fucosyl GMl, mesothelin, PSCA, MAGE Al, sLe(a), CYPIB I, PLACl, GM3, BORIS, Tn, GloboH, ETV6-AML, NY-BR-1, RGSS, SART3, STn, carbonic anhydrase IX, PAX5, OY-TES1, sperm protein 17, LCK, HMWMAA, AKAP-4, SSX2, XAGE 1, B7H3, legumain, Tie 2, VEGFR2, MAD-CT-1, FAP, PDGFR-β, MAD-CT-2, Notch1, ICAM1 and Fos-related antigen 1.
In certain embodiments, the methods embodied herein, identifying a mammal as having cancer. The methods include, whole genome sequencing of cfDNA fragments; mapping of the cfDNA fragments to a genomic origin and evaluating fragment length and obtaining genome-wide fragmentation profiles for each sample; identifying protein biomarkers of the subject; comparing the subject's cfDNA fragmentation profile and protein biomarkers with normal reference non-cancer subjects. The cfDNA fragmentation profile is compared to a reference cfDNA fragmentation profile, and identifying the mammal as having cancer when the cfDNA fragmentation profile in the sample obtained from the mammal is different from the reference cfDNA fragmentation profile.
In certain embodiments, a subject is diagnosed as having cancer, e.g. early stage cancer. In certain embodiments, the type of cancer is identified and the cancer is treated by various therapeutics, including therapeutics specific for the type of cancer. The cancer treatment can be surgery, adjuvant chemotherapy, neoadjuvant chemotherapy, radiation therapy, hormone therapy, cytotoxic therapy, immunotherapy, adoptive T cell therapy, targeted therapy, or any combinations thereof. The method also can include administering to the mammal a cancer treatment (e.g., surgery, adjuvant chemotherapy, neoadjuvant chemotherapy, radiation therapy, hormone therapy, cytotoxic therapy, immunotherapy, adoptive T cell therapy, targeted therapy, or any combinations thereof). The mammal can be monitored for the presence of cancer after administration of the cancer treatment.
Cancer therapies in general also include a variety of combination therapies with both chemical and radiation based treatments. Combination chemotherapies include, for example, cisplatin (CDDP), carboplatin, procarbazine, mechlorethamine, cyclophosphamide, camptothecin, ifosfamide, melphalan, chlorambucil, busulfan, nitrosurea, dactinomycin, daunorubicin, doxorubicin, bleomycin, plicomycin, mitomycin, etoposide (VP16), tamoxifen, raloxifene, estrogen receptor binding agents, taxol, gemcitabien, navelbine, famesyl-protein transferase inhibitors, transplatinum, 5-fluorouracil, vincristine, vinblastine and methotrexate, Temazolomide (an aqueous form of DTIC), or any analog or derivative variant of the foregoing. The combination of chemotherapy with biological therapy is known as biochemotherapy. The chemotherapy may also be administered at low, continuous doses which is known as metronomic chemotherapy.
Yet further combination chemotherapies include, for example, alkylating agents such as thiotepa and cyclosphosphamide; alkyl sulfonates such as busulfan, improsulfan and piposulfan; aziridines such as benzodopa, carboquone, meturedopa, and uredopa; ethylenimines and methylamelamines including altretamine, triethylenemelamine, trietylenephosphoramide, triethiylenethiophosphoramide and trimethylolomelamine;
acetogenins (especially bullatacin and bullatacinone); a camptothecin (including the synthetic analogue topotecan); bryostatin; callystatin; CC-1065 (including its adozelesin, carzelesin and bizelesin synthetic analogues); cryptophycins (particularly cryptophycin 1 and cryptophycin 8); dolastatin; duocarmycin (including the synthetic analogues, KW-2189 and CB1-TM1); eleutherobin; pancratistatin; a sarcodictyin; spongistatin; nitrogen mustards such as chlorambucil, chlornaphazine, cholophosphamide, estramustine, ifosfamide, mechlorethamine, mechlorethamine oxide hydrochloride, melphalan, novembichin, phenesterine, prednimustine, trofosfamide, uracil mustard; nitrosureas such as carmustine, chlorozotocin, fotemustine, lomustine, nimustine, and ranimnustine; antibiotics such as the enediyne antibiotics (e.g., calicheamicin, especially calicheamicin gammall and calicheamicin omegall; dynemicin, including dynemicin A; bisphosphonates, such as clodronate; an esperamicin; as well as neocarzinostatin chromophore and related chromoprotein enediyne antiobiotic chromophores, aclacinomysins, actinomycin, authrarnycin, azaserine, bleomycins, cactinomycin, carabicin, carminomycin, carzinophilin, chromomycinis, dactinomycin, daunorubicin, detorubicin, 6-diazo-5-oxo-L-norleucine, doxorubicin (including morpholino-doxorubicin, cyanomorpholino-doxorubicin, 2-pyrrolino-doxorubicin and deoxydoxorubicin), epirubicin, esorubicin, idarubicin, marcellomycin, mitomycins such as mitomycin C, mycophenolic acid, nogalarnycin, olivomycins, peplomycin, potfiromycin, puromycin, quelamycin, rodorubicin, streptonigrin, streptozocin, tubercidin, ubenimex, zinostatin, zorubicin; anti-metabolites such as methotrexate and 5-fluorouracil (5-FU); folic acid analogues such as denopterin, pteropterin, trimetrexate; purine analogs such as fludarabine, 6-mercaptopurine, thiamiprine, thioguanine; pyrimidine analogs such as ancitabine, azacitidine, 6-azauridine, carmofur, cytarabine, dideoxyuridine, doxifluridine, enocitabine, floxuridine; androgens such as calusterone, dromostanolone propionate, epitiostanol, mepitiostane, testolactone; anti-adrenals such as mitotane, trilostane; folic acid replenisher such as frolinic acid; aceglatone; aldophosphamide glycoside; aminolevulinic acid; eniluracil; amsacrine; bestrabucil; bisantrene; edatraxate; defofamine; demecolcine; diaziquone; elformithine; elliptinium acetate; an epothilone; etoglucid; gallium nitrate; hydroxyurea; lentinan; lonidainine; maytansinoids such as maytansine and ansamitocins; mitoguazone; mitoxantrone; mopidanmol; nitraerine; pentostatin; phenamet; pirarubicin; losoxantrone; podophyllinic acid; 2-ethylhydrazide; procarbazine; PSK polysaccharide complex; razoxane; rhizoxin; sizofiran; spirogermanium; tenuazonic acid; triaziquone; 2,2′,2″-trichlorotriethylamine; trichothecenes (especially T-2 toxin, verracurin A, roridin A and anguidine); urethan; vindesine; dacarbazine; mannomustine; mitobronitol; mitolactol; pipobroman; gacytosine; arabinoside (“Ara-C”); cyclophosphamide; taxoids, e.g., paclitaxel and docetaxel gemcitabine; 6-thioguanine; mercaptopurine; platinum coordination complexes such as cisplatin, oxaliplatin and carboplatin; vinblastine; platinum; etoposide (VP-16); ifosfamide; mitoxantrone; vincristine; vinorelbine; novantrone; teniposide; edatrexate; daunomycin; aminopterin; xeloda; ibandronate; irinotecan (e.g., CPT-11); topoisomerase inhibitor RFS 2000; difluorometlhylornithine (DMFO); retinoids such as retinoic acid; capecitabine; carboplatin, procarbazine, plicomycin, gemcitabien, navelbine, farnesyl-protein transferase inhibitors, transplatinum; and pharmaceutically acceptable salts, acids or derivatives of any of the above.
Immunotherapeutics, generally, rely on the use of immune effector cells and molecules to target and destroy cancer cells. The immune effector may be, for example, an antibody specific for some marker on the surface of a tumor cell. The antibody alone may serve as an effector of therapy or it may recruit other cells to actually effect cell killing. The antibody also may be conjugated to a drug or toxin (chemotherapeutic, radionuclide, ricin A chain, cholera toxin, pertussis toxin, etc.) and serve merely as a targeting agent. Alternatively, the effector may be a lymphocyte carrying a surface molecule that interacts, either directly or indirectly, with a tumor cell target. Various effector cells include cytotoxic T cells and NK cells as well as genetically engineered variants of these cell types modified to express chimeric antigen receptors.
The immunotherapy may comprise suppression of T regulatory cells (Tregs), myeloid derived suppressor cells (MDSCs) and cancer associated fibroblasts (CAFs). In some embodiments, the immunotherapy is a tumor vaccine (e.g., whole tumor cell vaccines, peptides, and recombinant tumor associated antigen vaccines), or adoptive cellular therapies (ACT) (e.g., T cells, natural killer cells, TILs, and LAK cells). The T cells may be engineered with chimeric antigen receptors (CARs) or T cell receptors (TCRs) to specific tumor antigens. As used herein, a chimeric antigen receptor (or CAR) may refer to any engineered receptor specific for an antigen of interest that, when expressed in a T cell, confers the specificity of the CAR onto the T cell. Once created using standard molecular techniques, a T cell expressing a chimeric antigen receptor may be introduced into a patient, as with a technique such as adoptive cell transfer. In some aspects, the T cells are activated CD4 and/or CD8 T cells in the individual which are characterized by γ-IFN-producing CD4 and/or CD8 T cells and/or enhanced cytolytic activity relative to prior to the administration of the combination. The CD4 and/or CD8 T cells may exhibit increased release of cytokines selected from the group consisting of IFN-γ, TNF-α and interleukins. The CD4 and/or CD8 T cells can be effector memory T cells. In certain embodiments, the CD4 and/or CD8 effector memory T cells are characterized by having the expression of CD44high CD62Llow.
The immunotherapy may be a cancer vaccine comprising one or more cancer antigens, in particular a protein or an immunogenic fragment thereof, DNA or RNA encoding said cancer antigen, in particular a protein or an immunogenic fragment thereof, cancer cell lysates, and/or protein preparations from tumor cells. As used herein, a cancer antigen is an antigenic substance present in cancer cells. In principle, any protein produced in a cancer cell that has an abnormal structure due to mutation can act as a cancer antigen. In principle, cancer antigens can be products of mutated Oncogenes and tumor suppressor genes, products of other mutated genes, overexpressed or aberrantly expressed cellular proteins, cancer antigens produced by oncogenic viruses, oncofetal antigens, altered cell surface glycolipids and glycoproteins, or cell type-specific differentiation antigens. Examples of cancer antigens include the abnormal products of ras and p53 genes. Other examples include tissue differentiation antigens, mutant protein antigens, oncogenic viral antigens, cancer-testis antigens and vascular or stromal specific antigens. Tissue differentiation antigens are those that are specific to a certain type of tissue. Mutant protein antigens are likely to be much more specific to cancer cells because normal cells shouldn't contain these proteins. Normal cells will display the normal protein antigen on their MHC molecules, whereas cancer cells will display the mutant version. Some viral proteins are implicated in forming cancer, and some viral antigens are also cancer antigens. Cancer-testis antigens are antigens expressed primarily in the germ cells of the testes, but also in fetal ovaries and the trophoblast. Some cancer cells aberrantly express these proteins and therefore present these antigens, allowing attack by T-cells specific to these antigens. Exemplary antigens of this type are CTAG1 B and MAGEA1 as well as Rindopepimut, a 14-mer intradermal injectable peptide vaccine targeted against epidermal growth factor receptor (EGFR) v111 variant. Rindopepimut is particularly suitable for treating glioblastoma when used in combination with an inhibitor of the CD95/CD95L signaling system as described herein. Also, proteins that are normally produced in very low quantities, but whose production is dramatically increased in cancer cells, may trigger an immune response. An example of such a protein is the enzyme tyrosinase, which is required for melanin production. Normally tyrosinase is produced in minute quantities but its levels are very much elevated in melanoma cells. Oncofetal antigens are another important class of cancer antigens. Examples are alphafetoprotein (AFP) and carcinoembryonic antigen (CEA). These proteins are normally produced in the early stages of embryonic development and disappear by the time the immune system is fully developed. Thus self-tolerance does not develop against these antigens. Abnormal proteins are also produced by cells infected with oncoviruses, e.g. EBV and HPV. Cells infected by these viruses contain latent viral DNA which is transcribed and the resulting protein produces an immune response. A cancer vaccine may include a peptide cancer vaccine, which in some embodiments is a personalized peptide vaccine. In some embodiments. the peptide cancer vaccine is a multivalent long peptide vaccine, a multi-peptide vaccine, a peptide cocktail vaccine, a hybrid peptide vaccine, or a peptide-pulsed dendritic cell vaccine
The immunotherapy may be an antibody, such as part of a polyclonal antibody preparation, or may be a monoclonal antibody. The antibody may be a humanized antibody, a chimeric antibody, an antibody fragment, a bispecific antibody or a single chain antibody. An antibody as disclosed herein includes an antibody fragment, such as, but not limited to, Fab, Fab′ and F(ab′)2, Fd, single-chain Fvs (scFv), single-chain antibodies, disulfide-linked Fvs (sdfv) and fragments including either a VL or VH domain. In some aspects, the antibody or fragment thereof specifically binds epidermal growth factor receptor (EGFR1, Erb-B1), HER2/neu (Erb-B2), CD20, Vascular endothelial growth factor (VEGF), insulin-like growth factor receptor (IGF-1R), TRAIL-receptor, epithelial cell adhesion molecule, carcino-embryonic antigen, Prostate-specific membrane antigen, Mucin-1, CD30, CD33, or CD40.
Examples of monoclonal antibodies include, without limitation, trastuzumab (anti-HER2/neu antibody); Pertuzumab (anti-HER2 mAb); cetuximab (chimeric monoclonal antibody to epidermal growth factor receptor EGFR); panitumumab (anti-EGFR antibody); nimotuzumab (anti-EGFR antibody); Zalutumumab (anti-EGFR mAb); Necitumumab (anti-EGFR mAb); MDX-210 (humanized anti-HER-2 bispecific antibody); MDX-210 (humanized anti-HER-2 bispecific antibody); MDX-447 (humanized anti-EGF receptor bispecific antibody); Rituximab (chimeric murine/human anti-CD20 mAb); Obinutuzumab (anti-CD20 mAb); Ofatumumab (anti-CD20 mAb); Tositumumab-I131 (anti-CD20 mAb); Ibritumomab tiuxetan (anti-CD20 mAb); Bevacizumab (anti-VEGF mAb); Ramucirumab (anti-VEGFR2 mAb); Ranibizumab (anti-VEGF mAb); Aflibercept (extracellular domains of VEGFR1 and VEGFR2 fused to IgG1 Fc); AMG386 (angiopoietin-1 and -2 binding peptide fused to IgG1 Fc); Dalotuzumab (anti-IGF-1R mAb); Gemtuzumab ozogamicin (anti-CD33 mAb); Alemtuzumab (anti-Campath-1/CD52 mAb); Brentuximab vedotin (anti-CD30 mAb); Catumaxomab (bispecific mAb that targets epithelial cell adhesion molecule and CD3); Naptumomab (anti-5T4 mAb); Girentuximab (anti-Carbonic anhydrase ix); or Farletuzumab (anti-folate receptor). Other examples include antibodies such as Panorex™. (17-1A) (murine monoclonal antibody); Panorex (MAb17-1A) (chimeric murine monoclonal antibody); BEC2 (ami-idiotypic mAb, mimics the GD epitope) (with BCG); Oncolym (Lym-1 monoclonal antibody); SMART M195 Ab, humanized 13′ 1 LYM-1 (Oncolym), Ovarex (B43.13, anti-idiotypic mouse mAb); 3622W94 mAb that binds to EGP40 (17-1A) pancarcinoma antigen on adenocarcinomas; Zenapax (SMART Anti-Tac (IL-2 receptor); SMART M195 Ab, humanized Ab, humanized); NovoMAb-G2 (pancarcinoma specific Ab); TNT (chimeric mAb to histone antigens); TNT (chimeric mAb to histone antigens); Gliomab-H (Monoclonals-Humanized Abs); GNI-250 Mab; EMD-72000 (chimeric-EGF antagonist); LymphoCide (humanized IL.L.2 antibody); and MDX-260 bispecific, targets GD-2, ANA Ab, SMART IDIO Ab, SMART ABL 364 Ab or ImmuRAIT-CEA. Further examples of antibodies include Zanulimumab (anti-CD4 mAb), Keliximab (anti-CD4 mAb); Ipilimumab (MDX-101; anti-CTLA-4 mAb); Tremilimumab (anti-CTLA-4 mAb); (Daclizumab (anti-CD25/IL-2R mAb); Basiliximab (anti-CD25/IL-2R mAb); MDX-1106 (anti-PD1 mAb); antibody to GITR; GC1008 (anti-TGF-β antibody); metelimumab/CAT-192 (anti-TGF-β antibody); lerdelimumab/CAT-152 (anti-TGF-β antibody); ID11 (anti-TGF-β antibody); Denosumab (anti-RANKL mAb); BMS-663513 (humanized anti-4-1BB mAb); SGN-40 (humanized anti-CD40 mAb); CP870,893 (human anti-CD40 mAb); Infliximab (chimeric anti-TNF mAb; Adalimumab (human anti-TNF mAb); Certolizumab (humanized Fab anti-TNF); Golimumab (anti-TNF); Etanercept (Extracellular domain of TNFR fused to IgG1 Fc); Belatacept (Extracellular domain of CTLA-4 fused to Fc); Abatacept (Extracellular domain of CTLA-4 fused to Fc); Belimumab (anti-B Lymphocyte stimulator); Muromonab-CD3 (anti-CD3 mAb); Otelixizumab (anti-CD3 mAb); Teplizumab (anti-CD3 mAb); Tocilizumab (anti-IL6R mAb); REGN88 (anti-IL6R mAb); Ustekinumab (anti-IL-12/23 mAb); Briakinumab (anti-IL-12/23 mAb); Natalizumab (anti-α4 integrin); Vedolizumab (anti-α4 β7 integrin mAb); T1 h (anti-CD6 mAb); Epratuzumab (anti-CD22 mAb); Efalizumab (anti-CD11a mAb); and Atacicept (extracellular domain of transmembrane activator and calcium-modulating ligand interactor fused with Fc).
The rapid technological and analytical advancements in liquid biopsy analyses have identified cancer-related features in the cfDNA fragments in peripheral blood and have provided a new avenue for noninvasive detection of cancer. Mutations or methylation in circulating tumor DNA (ctDNA) can be directly detected in early stage lung cancer patients without prior knowledge of these alterations in tumors16-21. Given the relatively small number of sequence or epigenetic alterations that can be assessed by targeted high coverage sequencing, many individuals with cancer may be missed by such approaches and may also require sequencing of white blood cells (WBCs) to eliminate changes that result from clonal hematopoiesis17,18,22. To increase the sensitivity of detection of early stage cancers a genome-wide approach was developed for analysis of cfDNA fragmentation profiles called DELFI (DNA evaluation of fragments for early interception)23. This approach provides a view of the cfDNA “fragmentome”, permitting evaluation of the size distribution and frequency of millions of naturally occurring cfDNA fragments across the genome. As the cfDNA fragmentome can comprehensively represent both genomic and chromatin characteristics, it has the potential to identify a large number of tumor-derived changes in the circulation. In this study, this methodology was utilized for lung cancer detection and characterization in a prospectively collected real-world cohort comprising patients with malignant and benign pulmonary nodules as well as non-cancer individuals, including those with other clinical conditions (
Study Population Analyzed
The LUCAS diagnostic cohort is a prospectively collected cohort of 368 predominantly symptomatic patients that presented in the Department of Respiratory Medicine, Infiltrate Unite, Bispebjerg Hospital, Copenhagen with a positive imaging finding on a chest X-ray or a chest CT. The study was conducted over 7 months from September 2012 to March 2013, and all patients had a clinical follow up until death or April, 2020. All patients had blood samples collected at their first clinic visit before the possible diagnosis of lung cancer was made. Samples from 365 patients that passed quality control from genomic sequencing were included in subsequent analyses. The analyzed cohort included 158 patients with no prior, baseline or future cancers, 129 patients with baseline lung cancer, and 78 patients without cancer at the time of blood collection, but with either earlier or later cancers (
Sample Collection and Preservation
The sample collection for the LUCAS cohort was obtained at the time of the screening visit and performed as follows: venous peripheral blood was collected in one K2-EDTA tube and two serum gel tubes. Within two hours from blood collection tubes were centrifuged at 2330 g at 4° C. for 10 min After centrifugation, EDTA plasma and serum were aliquoted and stored at −80° C. for cfDNA and protein analyses, respectively.
For the validation cohort, venous peripheral blood for each individual was collected in one EDTA tube. Tubes were centrifuged at low speed (1500-3000 g) for 10-15 min within two hours from blood collection. The plasma portion from the first spin was spun a second time for 10 min. After centrifugation EDTA plasma was aliquoted and stored at −80° C. for cfDNA analyses.
Sequencing Library Preparation
Circulating cell-free DNA was isolated from 2-4 ml of plasma using the Qiagen QIAamp Circulating Nucleic Acids Kit (Qiagen GmbH), eluted in 52 μl of RNase-free water containing 0.04% sodium azide (Qiagen GmbH), and stored in LoBind tubes (Eppendorf AG) at −20° C. Concentration and quality of cfDNA were assessed using the Bioanalyzer 2100 (Agilent Technologies).
Next-generation sequencing (NGS) cfDNA libraries were prepared for WGS using 15 ng cfDNA when available, or entire purified amount when less than 15 ng. For the validation cohort available cfDNA up to 125 ng was used as input material for library preparation. In brief, genomic libraries were prepared using the NEBNext DNA Library Prep Kit for Illumina (New England Biolabs (NEB)) with four main modifications to the manufacturer's guidelines: (i) the library purification steps use the on-bead AMPure XP (Beckman Coulter) approach to minimize sample loss during elution and tube transfer steps; (ii) NEBNext End Repair, A-tailing and adaptor ligation enzyme and buffer volumes were adjusted as appropriate to accommodate the on-bead AMPure XP purification strategy; (iii) Illumina dual index adaptors were used in the ligation reaction; and (iv) cfDNA libraries were amplified with Phusion Hot Start Polymerase. All samples underwent a 4 cycle PCR amplification after the DNA ligation step.
In total, 23 batches of cfDNA library preparations were performed for the LUCAS cohort. Each batch included a combination of cancer patients and non-cancer controls. All batches included a technical replicate of nucleosomal DNA obtained from nuclease-digested human peripheral blood monocytes (PBMCs) to assess sequencing consistency across batches performed on a different date. A negative library control was periodically included where buffer TE pH 8.0 was used instead of a DNA sample to assure there was no DNA contamination during the library preparation. The validation cohort was prepared in the same fashion as above and the 485 samples, including embodiments and controls, were spread over 33 batches.
Low Coverage Whole Genome Sequencing and Alignment
Whole-genome libraries of cancer patients and cancer-free individuals were sequenced using 100-bp paired-end runs (200 cycles) on the Illumina HiSeq2500 platform at 1-2× coverage per genome. Prior to alignment, adapter sequences were filtered from reads using the fastp software34. Sequence reads were aligned against the hg19 human reference genome using Bowtie235 and duplicate reads were removed using Sambamba36. Post-alignment, each aligned pair was converted to a genomic interval representing the sequenced DNA fragment using bedtools37. Only reads with a mapq score of at least 30 or greater were retained. Read pairs were further filtered if overlapping a problematic region provided by the Duke Excluded Regions blacklist (genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeMapability; Amemiya, H. M., Kundaje, A. & Boyle, A. P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z). To capture large-scale epigenetic differences in fragmentation across the genome estimable from low-coverage whole genome sequencing, the hg19 reference genome was tiled into 473 non-overlapping 5 Mb bins spanning approximately 2.4 GB of the genome. All bins had an average GC content >=0.3 and an average mappability >=0.9.
Whole Genome Fragment Features
Ratios of the number of short (100 to 150 bp) to long (151 to 220 bp) fragments across the 473 bins were normalized for GC-content and library size. As GC-related biases have been largely attributed to preferential amplification of fragments during PCR40, a non-parametric method was developed for fragment-level GC adjustment. For each individual in the LUCAS cohort, fragments were assigned to one of 100 possible GC strata between 0 and 1 (1 indicating a fragment with all G and C nucleotides), obtaining the total number of fragments within each GC stratum. In the same manner, a distribution of fragment counts was obtained by GC stratum for the held-out set of 54 non-cancer samples as well as the median of the 54 distributions that were referred to as a target distribution. To normalize sample-to-sample PCR biases in LUCAS, the collection of fragments in GC stratum i were assigned a weight wi such that Σi=1N
In addition to fragmentation profiles, z-scores were also computed for chromosomal arms and a genome-wide summary of the overall cfDNA fragmentation. Z-scores for each of the 39 autosomal arms were obtained by centering and scaling the total GC-adjusted fragment count for each arm by the mean and standard deviation of the corresponding arm-specific counts in the 54 non-cancer samples used as a reference set23,41. To summarize the overall cfDNA fragmentation sizes for each sample, we computed the ratio of multinucleosomal (>=250 bp) to mononucleosomal (<250 bp) fragments.
Analysis Based on Public Databases from TCGA
Copy number data from the two lung cancer cohorts in TCGA (LUAD n=518 and LUSC n=501) were retrieved using the package RTCGA v1.16.051. The tumor to normal log copy ratio values were compared to tumor type specific thresholds52 to identify genomic regions harboring copy gains and losses. The copy number status in each of the 5 Mb bins across the genome was determined by requiring a minimum coverage of 90% of the bin interval by segments harboring a gain or loss. The frequency of copy gain and loss in the genomic bins were calculated for each of the lung cancer cohorts in TCGA.
Machine Learning and Cross-Validation Analyses
Five-fold cross-validation was used to develop a predictive model for early and late stage cancer detection where feature selection and model development were evaluated on four of the five folds (training set) and a fifth held-out fold was used only to assess model performance (test set). The total number of samples available for training and testing include 158 participants with no prior, baseline or future cancer and 129 patients with cancer at the time of the blood draw. Patients with a prior or future cancer (n=78) were not used for training or testing. Due to the high dimensionality of the fragmentation features relative to the number of available samples for training, a principal components analysis was performed within each training set to reduce the dimensionality of the feature space, retaining the minimum number of principal components needed to explain 90% of the variance of the fragmentation profiles between samples. In addition to the principal component features, all 39 z-scores were evaluated in a logistic regression model with a LASSO penalty. The optimized LASSO penalty of 0.0017 in the analysis was obtained by resampling using the caret R package. The DELFI score derived for each sample corresponds to the mean score across the 10 cross validation repeats. As both the feature selection (principal components analysis) and LASSO model were evaluated only on the samples in the training set, this approach provides an unbiased estimate of model performance.
As an additional measure of model performance, a final model was obtained using all 158 non-cancer individuals and 129 patients with baseline lung cancer in the LUCAS study and evaluated this fixed model in an independent cohort (validation set) of non-cancer individuals (n=385) and predominantly early stage cancer (n=46). Using the fixed model, a prediction score was obtained for each individual in the validation set and classified each individual as non-cancer or cancer using a fixed cutoff from LUCAS that provided a desired specificity of 80%. Notably, the samples from the validation cohort were entirely batch-independent from the LUCAS cohort with respect to sample collection, library preparation, and sequencing.
To assess whether clinical and serum protein markers in addition to fragmentation features could further improve prediction, a multimodal predictive model was evaluated using the repeated five-fold cross-validation approach. Fragmentation features summarized by a principal component analysis and z-scores were evaluated as described above such that both feature selection and estimation of model parameters were independent of the test set. For clinical and serum protein markers, we included age, smoking history, COPD status, and CEA. A logistic regression model with a LASSO penalty was used to evaluate the fragmentation, clinical, and protein biomarker features in each training fold.
Treatment in the LUCAS Cohort
All patients were evaluated after diagnosis for eligibility for either 1) primary surgery, 2) concomitant chemotherapy and radiotherapy with curative intent, 3) standard palliative systemic oncological treatment (with either chemotherapy or targeted therapy), or 4) best supportive care—all according to the Danish national treatment guidelines for lung cancer in 2012-13, which were in concordance with the ESMO guidelines42-44.
All patients were evaluated for possible primary surgery based on TNM-stage as well as possible co-morbidities that might prevent the possibility for anesthesia. Patients underwent primary lung surgery for a solitary lung metastasis (two colo-rectal, one testis cancer, and one breast cancer), and a subset received post-surgery adjuvant chemotherapy according to ESMO guidelines in 2012-2013. If patients were not eligible for primary surgery, they were then evaluated at a multi-disciplinary team conference for concomitant chemotherapy (platinum dublet combined with either Vinorelbine (for NSCLC) or Etoposide (for SCLC)) and radiotherapy (either 2 Gray in 33 fractions, 5 F/W or stereotactic radiotherapy 15 Gray in 3 fractions (for NSCLC) or 2.05 Gray in 45 fractions or 3 Gray in 10 fractions (for SCLC)) with curative intent. Patients with poor ECOG performance status and/or significant co-morbidities were precluded from having any type of oncological treatment and were referred for supportive care. Patients with advanced disease at the time of diagnosis were eligible for palliative chemotherapy and/or radiotherapy. Patients with an EGFR mutation were primarily treated with Gefitinib in 1st line and Erlotinib following either Gefitinib or chemotherapy and those with ALK-translocation were treated with crizotinib. Patients from the initial palliative treatment cohort went on to receive 2nd line oncological treatment after progression of disease (typically Pemetrexed monotherapy). Only 16 patients received additional therapy after a 2nd line treatment, with one patient receiving a total of seven lines of treatment. Since the cohort is from 2012-2013 only two patients received immunotherapy (Nivolumab) in respectively 3rd and 7th line in a CheckMate protocol.
Association of Clinical Covariates and Survival with the DELFI Score
Univariate analyses comparing the distribution of the DELFI score to baseline clinical and laboratory covariates age, smoking, and serum inflammatory markers was performed using a Wilcoxon rank sum test. Additionally, the relationship of the DELFI score and cancer risk with and without baseline covariates age, smoking, and sex was evaluated using logistic regression.
To assess whether the DELFI score was associated with prognosis, high risk lung cancer patients were categorized according to whether they were more likely to have cancer than not (DELFI score >0.5). To assess whether this categorization was associated with survival among lung cancer patients in a univariate analysis, a log rank test was used to compare survival curves. In addition to the univariate analysis, it was evaluated whether the DELFI score was independently associated with lung cancer specific survival in a multivariable Cox proportional hazards model that included histological subtypes of primary lung cancer and clinical staging as additional explanatory variables.
Genome-Wide Transcription Factor Analyses for Prediction of Histological Subtype
Gene expression values were obtained as raw counts using recount3 1.0.256 and converted to transcripts per million (TPM) using recount 1.16.1 for SCLC (n=79)57, lung adenocarcinoma (n=542) and lung squamous cell carcinoma (n=504) generated by The Cancer Genome Atlas (TCGA), and whole blood (n=755) generated by the Genotype-Tissue Expression (GTEx) project. The median TPM value was computed for 1,639 transcription factors (TFs)58 in each cancer/tissue type. TFs that were unexpressed (median TPM <1) were identified in lung adenocarcinoma, lung squamous cell carcinoma, and whole blood, and then ordered them from highest to lowest expression in SCLC. The top gene was ASCL1 (median TPM=101). Chromatin immunoprecipitation was then obtained followed by sequencing (ChIP-seq) peaks for ASCL1 (n=13,920 peaks) (GEO Sample accession number: GSM3704421)59. For each peak in in the autosomes (n=13,693 peaks) the center of the peak was defined as position 0 and then the coverage was computed in a +/−3,000 bp window around each peak separately for 125 samples with a DELFI score of at least 0.37 (corresponding to a specificity of 85%). A small number of peaks were excluded with an average coverage of >3 across samples. The mean of the coverages at each position (−3,000 to +3,000) across all peaks was computed for each sample. For
Using fragments <200 bp, the mean fragment size was computed at each position in a +/−3,000 bp window surrounding the ASCL1 binding sites. For
Modelling of DELFI Performance in a Screening Population
To normalize sample-to-sample PCR biases in LUCAS, the collection of fragments in GC stratum i were assigned a weight wi such that Σi=1N
To assess performance of LDCT alone and DELFI followed by LDCT in a hypothetical screening population of 100,000 individuals, we used Monte Carlo simulations to capture uncertainty of unknown parameters sensitivity, specificity, adherence, and lung cancer prevalence. Prior models of sensitivity for LUCAS alone were centered loosely on empirical estimates from the LUCAS and NLST cohorts:
We sampled π˜Bernoulli(0.5).
For specificity, prior models were
The number of individuals screened in our simulated screening study depends on adherence to screening guidelines. Letting n denote the size of our screening study, our sampling model for n is given by
n˜Binomial(105,η)
η˜beta(an,βn)
For LDCT alone, shape parameters αn and βn were 12 and 1887 while for DELFImulti followed by LDCT shape parameters were 80 and 2032. Conditional on the size of our screening study and draws of θ1,M and θ2,M from their respective prior distributions, we sampled the disease status, y, and screening results, x, conditional on y:
y
i˜Bernoulli(ψ) for i=1, . . . ,n
ψ˜Beta(9.1,990.9)
x
i
|{y
i=1,M}˜Bernoulli(θ1,M)
x
i
|{y
i=0,M}—Bernoulli(1−θ2,M)
The informative prior for prevalence, ψ, in our hypothetical population ensures that our screening study will be comprised predominantly of individuals without cancer, but allows the true prevalence to be smaller or larger that the estimate of 0.91% from the NLST study5. The number of patients with lung cancers detected, accuracy, false positive rate, and positive predictive value were calculated from the joint distribution of x and y. We repeated the above sampling procedure 10,000 times, thereby obtaining predictive distributions for these statistics that reflect uncertainty of sensitivity, specificity, adherence, and prevalence.
Bioinformatic and Statistical Software
All statistical analyses were performed using R version 3.6.1. After trimming of adapter sequences using fastp (0.20.0), we used Bowtie2 (2.3.0) to align paired end reads to the hg19 reference genome. PCR duplicates were removed using Sambamba (0.6.8) and the remaining aligned read pairs were converted to a bed format using Bedtools (2.29.0). The R package data.table (1.12.8) was used for manipulation of tabular data and binning fragments in 5 Mb windows along the genome. The R packages caret (6.0.84) and gbm (2.1.5) were used to implement the classification by gradient boosted trees and resampling.
Patient blood samples from a prospective diagnostic study of 365 individuals conducted at Bispebjerg Hospital in Copenhagen, Denmark (LUCAS cohort), were examined. during a seven month period. The majority of subjects in the cohort were symptomatic individuals at high-risk for lung cancer (age 50-80 and smoking history >20 pack-years) (Table 1). The cohort included 323 subjects (90%) with pulmonary, non-pulmonary or constitutional symptoms, with the majority having common smoking-related symptoms such as cough or dyspnea. The remainder were asymptomatic at enrollment with an incidental chest image finding by X-ray or CT that was suspicious for lung malignancy. At the time of the patient's clinic visit an additional chest CT or 18F-PET/CT was performed to assess the identified nodule or infiltrate (
DELFI Performance for Lung Cancer Detection
2-4 ml of plasma was isolated from each patient in the LUCAS cohort and the extracted cfDNA was examined using the DELFI approach with experimental and bioinformatic improvements. As PCR is known to affect the representation of amplified genomic fragments depending on GC content and fragment length, DELFI genome-wide fragmentation profiles were evaluated using genomic libraries created without amplification or with 4 or 12 cycles of PCR. It was found that libraries created with 4 cycles of PCR had profiles that were similar to those without any amplification, while 12 cycles led to substantial biases (
The resulting fragmentation profiles were remarkably consistent among non-cancer individuals, including those with non-malignant lung nodules (
Combining DELFI Profiles with Multimodal Analyses
As clinical characteristics may affect tumor biomarkers, it was first sought to investigate whether non-malignant nodules, demographic parameters such as age or smoking history, or the presence of chronic obstructive pulmonary disease (COPD) or autoimmune diseases were associated with DELFI scores. An unbiased analysis of these characteristics was possible because of the prospective observational trial collection of the LUCAS cohort. No difference in the DELFI score was observed when comparing non-cancer individuals with or without benign lung lesions (median DELFI score 0.16 vs 0.21, p=0.99, Wilcoxon rank sum test,
DELFI Analyses of Lung Cancer Progression and Outcome
The relationship between DELFI scores and cancer stage and histology was examined next. While the DELFI score for non-cancer individuals was low (median DELFI scores of 0.16 or 0.21 for those without a biopsy or with benign lesions, respectively), patients with cancer had significantly higher median DELFI scores (DELFI scores for stage I=0.35, stage II=0.75, stage III=0.90, and stage IV=0.99)(p<0.01 for Stages I, II, III, or IV, Wilcoxon rank sum test) (
To externally validate the predictive performance of DELFI in an independent group of individuals with or without lung cancer, a single fragmentation-based model was first obtained using the non-cancer individuals and patients with baseline lung cancer in the LUCAS study and determined the DELFI score cutoff required to achieve specificities ranging from 70%-85%. Next, in an independent validation cohort comprised of individuals without cancer (n=385) or predominantly early stage cancer (n=46), the fixed model in LUCAS was used to compute DELFI scores in the validation set. Using the previously established cutoffs, the cancer status for individuals in the validation set was predicted according to whether their DELFI score was above or below the cutoff. The sensitivities and specificities of this model in the validation cohort were similar to those observed in the LUCAS cohort at different stages of the disease and among different histologic subtypes (
Combining DELFI Profiles with Multimodal Analyses
To evaluate multimodal approaches for cancer detection in combination with the multi-feature cfDNA analyses, the serum levels of carcinoembryonic antigen (CEA), a secreted protein that has been proposed as a lung biomarker12,15,32,33 was first assessed. Patients with lung cancer had higher CEA levels compared to patients without cancer, with more than 20% of stage I-III and the majority of stage IV cancer patients detected at levels >7.5 ng/ml, while only ˜4% of non-cancer patients fell above this threshold12,15,34 (p<0.001)(
DELFI Analyses of Lung Cancer Progression and Outcome
To examine the relationship between fragmentation profiles and lung tumor progression it was assessed whether the size of the lung cancer lesion or other clinical or radiological findings were related to aberrant fragmentation profiles. Although previous studies suggest small tumors (e.g. ˜1 cm3) may be missed by mutation based approaches given the limited number of ctDNA molecules at specific locations and limits of detection with these methods of ˜0.1%20, genome-wide approaches may allow for more sensitive detection of such changes. As the DELFI approach interrogates ˜40 million fragments, it was expected that ˜40,000 fragments across the genome would be tumor derived in a patient with a small tumor having a 0.1% ctDNA contribution, thereby increasing the chances of detection. Interestingly in the LUCAS cohort, eight of the nine tumors less than two cm in size (T1a) had DELFI scores higher than the median non-cancer population (median DELFI score of 0.40 vs. 0.16) (
The long clinical follow-up of the LUCAS cohort (7-8 years) enabled an analysis of the association between DELFI scores and survival. These analyses revealed that a DELFI score greater than 0.5 was associated with a decreased overall survival compared to DELFI scores below 0.5 (P<0.001,
Given the important differences in biologic characteristics and clinical management of SCLC and NSCLC, it was evaluated whether genome-wide fragmentation profiles could be used to noninvasively distinguish between these cancer types. Publicly available TCGA RNA-seq data from lung cancer subtypes was used to identify transcription factors with the highest differential expression between SCLC (n=79) and NSCLC (n=1046) or white blood cell (n=755) samples, and identified ASCL1 (Achaete-Scute Family basic helix-loop-helix Transcription Factor 1) as the gene most highly differentially expressed (>960 fold compared to NSCLC and WBC) (
Analyses of patients with a previous history of cancer who were in clinical remission at the time of the DELFI baseline assessment identified 25 patients, five who recurred, and four who ultimately died from this disease. These included three patients with head and neck cancers, one with colon cancer, and one with malignant melanoma. Patients with subsequent recurrence had significantly higher DELFI scores than those individuals without recurrence (median DELFI scores 0.65 vs 0.19, p=0.005) (
Additionally, the longitudinal clinical follow-up available in the LUCAS cohort enabled an analysis of fragmentation profiles in individuals who were deemed cancer-free at baseline but who developed a new cancer after baseline assessment. Of the 17 study subjects with a subsequent cancer diagnosis within two years (excluding localized skin tumors), four patients had DELFI scores greater than 0.5 at the time of enrollment, ranging from 0.5 to 1.0 within 33 to 481 days after enrollment. The malignancies identified comprised one case of NSCLC, as well as three non-pulmonary malignancies including chronic lymphocytic leukemia (CLL) and two B cell lymphomas. These data provide evidence that elevated DELFI scores may identify the emergence of cancers that were clinically undetected.
To evaluate the theoretical impact of a non-invasive molecular blood test on lung cancer detection, the performance of the DELFI score or the multimodal DELFImulti score was examined, followed by standard diagnostic CT imaging in the LUCAS cohort. This would allow examining the scenario where high-risk individuals would first have a blood draw and, depending on the results of the cfDNA analyses, individuals follow the pathway of either having an LDCT if the DELFI score is positive or not having an LDCT if the score is negative (
To examine how the approach herein would perform for the overall detection of individuals with lung cancer at a population scale, the DELFI model was evaluated in a theoretical population of 100,000 high-risk individuals using Monte Carlo simulations. Using the estimated sensitivities and specificities of LDCT alone or with DELFI as a prescreen in this hypothetical population (
Overall, an improved DELFI approach is described for genome-wide fragmentation analyses for detection of lung cancer. It is proposed that facile and scalable analyses of the cfDNA fragmentome could be used to prescreen high-risk populations for lung cancer to increase accessibility of lung cancer detection and decrease unnecessary follow-up imaging procedures and invasive biopsies. Through the analysis of the LUCAS cohort, it was demonstrated that the DELFI approach can detect lung cancer across all stages and histologic subtypes compared to non-cancer individuals with or without benign lung nodules. The validation of the fixed DELFI model from the LUCAS cohort in an independent validation cohort supports the generalizability of the approach. Similar to observations with targeted sequencing approaches16,22,39-43, the relationship between DELFI scores and tumor progression and long-term mortality provides evidence that the blood based fragmentation analyses may identify occult disease not observed by imaging, or more accurately identify the aggressiveness of the disease. The distinction between NSCLC and SCLC may allow for noninvasive characterization and treatment of lung cancer patients when tissues are not available. The identification of patients by DELFI that were only identified months later to have cancer through standard diagnostic methods shows the utility of the approach for cancer detection, detection of recurrent disease, and the potential for detection of cancers at earlier stages (“stage shifting”) through lung cancer screening. The possibility of combining a genome-wide multi-feature fragmentation profile analyses with a standard protein marker and clinical characteristics provides an avenue for high complexity multimodal analyses that can further increase the sensitivity of the approach.
Despite the publication of the NLST trial almost a decade ago5, the impact of LDCT in reducing lung cancer morbidity and mortality has been limited. Challenges for this approach have included insufficient imaging facilities and infrastructure that can screen large numbers of patients, the complexity of the medical workup that requires frequent visits and shared decision making, and repetitive radiation exposure from annual screening44. Additionally, imaging studies detect radiographic abnormalities, not cancer, and result in biopsy-identified cancer diagnoses in only a small minority of positive scan findings, while the majority of false positive findings may drive invasive diagnostic procedures as well as ongoing patient anxiety during months or years of follow-up. Finally, while screening has been recognized as an important step for early detection of lung cancer in high-risk individuals, a significant percentage of lung cancer occurs in lower risk individuals45 and current USPSTF recommendations do not recommend LDCT screening for these patients due to the imbalance of harms and benefits.
The analyzed cohorts represent real-world, prospective populations and the collection and processing of all samples were performed in a systematic fashion, ensuring homogeneity of pre-analytical characteristics and careful control of experimental and analytical variables. The potential improvement of the positive predictive value in the combined LDCT/DELFI approach suggests that many fewer unnecessary procedures would be performed in individuals with positive results. Additionally, the DELFI score appears to not be affected by non-cancer conditions, which have confounded other potential biomarkers for lung cancer detection. The observations that scalable and cost-effective non-invasive cfDNA fragmentation analyses can discriminate lung cancer patients from non-cancer individuals may ultimately provide an opportunity to evaluate not only high-risk individuals but the general population for lung cancer.
While the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.
The patent and scientific literature referred to herein establishes the knowledge that is available to those with skill in the art. All United States patents and published or unpublished United States patent applications cited herein are incorporated by reference. All published foreign patents and patent applications cited herein are hereby incorporated by reference. All other published references, documents, manuscripts and scientific literature cited herein are hereby incorporated by reference.
The present application claims the benefit of priority of 1) U.S. provisional application No. 63/128,776, filed Dec. 21, 2020 and 2) U.S. provisional application No. 63/197,301, filed Jun. 4, 2021, each of which applications is incorporated by reference herein in their entirety.
This invention was made with government support under grants CA121113 and CA233259 awarded by the National Institutes of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US21/64613 | 12/21/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63197301 | Jun 2021 | US | |
63128776 | Dec 2020 | US |