The present invention relates to the diagnosis of lung tumors. It provides methods suitable both for diagnosing lung tumors on the basis of surgical samples and lung biopsies (here, e.g., with the aid of DNA microarrays) and of liquid biopsies. In the case of liquid biopsies, cell-free DNA (cfDNA) is used. In this context, both particularly suitable analysis methods and particularly suitable sets of methylation markers are described. Means suitable for diagnosing lung cancer by examining the methylation of a set of methylation markers, e.g., in cell-free DNA (cfDNA) from liquid biopsy samples of patients, wherein the means comprises oligonucleotides which can hybridize to DNA comprising the methylation markers, as well as the use of said methods and means for diagnosing, i.e., e.g., determinination, subtyping and prognostic characterization of lung tumors, are also objects of the invention.
Lung cancer is the second most common type of cancer in men and women worldwide. In Germany, approx. 52,500 new cases are registered annually. The mean age of onset of disease is 70 years for men and 69 years for women. A distinction is made between small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). NSCLCs are distinctly more common and occur in 85% of the affected patients. Furthermore, several subentities are distinguished in the case of NSCLCs, of which the most common are adenocarcinoma and squamous cell carcinoma.
The fact that the disease symptoms usually occur very late is reflected in a poor prognosis. The 5-year survival rate is at 15%.
Like most other tumors, lung carcinomas exhibit high genomic heterogeneity. For example, mutations within KRAS, EGFR, BRAF, MEK1, MET, HER2, ALK, ROS1, RET, FGFR1, DDR2, PTEN, LKB1, RB1, CDKN2A or TP53 genes can induce the development of a primary lung carcinoma. In addition, so-called passenger mutations accumulate during the course of tumor evolution, which can lead to various subclones. This fact renders the development of a reliable early-detection test based only on molecular-genetic mutation analyses very difficult, which becomes apparent from many examples in the literature.
For example, Uchida et al. have carried out a lung carcinoma screening based on typical mutations of the EGFR gene. The average sensitivity of this test was only 54.4% and dropped to 22.2% in the case of early stages IA-IIIA (Uchida et al. [2015] Clin. Chem. 61: 1191-1196). Couraud et al. developed an NGS-based test, in which the best-known mutations within the EGFR, BRAF, KRAS, HER2 and PIK3CA genes were analyzed in plasma. The sensitivity of said test was 58%. Here too, the detection of tumors in early stages posed a problem (Couraud et al. [2014] Clin. Cancer Res. 20: 4613-4624). In 2014, Newmann et al. developed the CAPP-Seq.
This was an optimized NGS protocol with an associated bioinformatic evaluation pipeline. In the case of CAPP-Seq, the best-known NSCLC mutations in plasma are sequenced and analyzed, which allowed for identifying 100% of stage II to IV lung cancer patients. However, the identification of tumors in stage I again posed a problem here, and the corresponding sensitivity was only 50% (Newman et al. [2014] Nat. Methods 20: 548-554). These examples clearly show the problem in developing a reliable early-detection test for lung carcinoma that is based only on genomic analyses.
In addition to mutations, epimutations also play a decisive role during tumor evolution. For example, promoters within certain tumor suppressor genes become hypermethylated, which, in turn, results in their transcriptional repression. This phenomenon is accompanied by the overexpression of DNA methyltransferases. Promoter hypermethylation has been described particularly frequently In the literature within the P16INK4A, RASSF1A, APC, RARB, CDH1, CDH13, DAPK, FHIT and MGMT genes (Langevin et al. [2015] Transl. Res. 165: 74-90).
The genome-wide hypomethylation of NSCLC is associated with genomic instability. Targeted hypomethylation of genes has so far been identified only in the case of MAGEA3/6, TKTL1, BORIS, DDR1, YWHAZ and TMSB10 (inter alia, Newman et al. [2014] Nat. Methods 20: 548-554).
Furthermore, malignant lung tumors frequently exhibit altered histone acetylation at positions H4K5, H4K8, H4K12 and H4K16. The global proportion of H4K20me3, too, is lower in NSCLC than in healthy lung tissue (Newman et al. [2014] Nat. Methods 20: 548-554). In addition, aberrant ncRNA expression can occur, such as, e.g., MIR196A, MIR200B, MALAT1 and HOTAIR.
According to national and international recommendations, the affected patients are currently initially subjected to a comprehensive physical examination in the event of a suspected diagnosis. Subsequently, the thorax is examined by imaging methods such as, e.g., radiography or computed tomography (CT). If tumors are detected in this process, subsequent bronchoscopies are recommended, during which the lungs are thoroughly analyzed endoscopically and biopsies of the tumors are taken. Said biopsies are, then, subjected to histological, immunohistochemical and molecular-genetic analyses.
During the histological examinations, it is determined whether the tumors are malignant. If this is the case, their entity is ascertained. To identify the optimal therapy, molecular-genetic and imaging methods are additionally considered. Due to the radiation exposure and invasiveness, especially the imaging and endoscopic methods can be stressful here for the affected patients.
The detection limit of the radiological methods is at a tumor size of 7 to 10 mm, which corresponds to cell clusters consisting of already roughly one billion tumor cells. An alternative, less invasive method is based on liquid biopsies, by means of which tumors can be detected much earlier, from a size of ca. 50 million cells.
In case of liquid biopsies, a few milliliters of blood are collected from the patient. Circulating cell-free DNA (cfDNA) can then be isolated from the blood plasma or blood serum. In the human body, cfDNA is formed during apoptotic and necrotic processes. This involves the cleavage of cellular, genomic DNA (gDNA) by DNAses into fragments of ca. 167 bp in length and their release into the bloodstream.
In the case of patients suffering from malignant diseases, the total amount of cfDNA additionally contains tumor DNA. The amount of cfDNA can vary greatly depending on the entity or stage of the disease. However, it contains diagnostically, therapeutically and prognostically relevant information.
In addition to genetic mutations of a tumor, epimutations can also be analyzed. In this context, DNA methylation is of particular interest. The DNA methylation pattern is tissue-specific and already changes in early phases of tumor evolution. Furthermore, a study of the GNAS1 locus made clear that cfDNA methylation in the blood remains stable. It is neither modified nor distorted and is thus suitable as a biomarker in clinical diagnostics (Puszyk et al. [2009] Clin. Chim. Acta 400: 107-110).
The diagnostic potential of DNA methylation has already been made clear by several studies. For instance, a SOX17 study in stomach carcinoma showed that the overall survival of the patient cohorts correlated with the detected amount of methylated SOX17 cfDNA (Balgkouranidou et al. [2013] Clin. Chem. Lab. Med. 51: 1505-1510). A study with female patients suffering from breast carcinoma showed significant hypermethylation of the CST6 gene (Chimonidou et al. [2013] Clin. Biochem. 46: 235-240). Liggett et al. were able to distinguish between pancreatic carcinoma and its precursor, chronic pancreatitis, based on the DNA methylation pattern (Liggett et al. [2010] Cancer 116: 1674-1680).
Alterations in the DNA methylation pattern have also been described in NSCLC by several working groups. For example, Balgkouranidou et al. could detect significant hypermethylation of the BRMS1 gene in patients with bronchial carcinoma (Balgkouranidou et al. [2014] Brit. J. Cancer 110: 2054-2062). In 2016, Marwitz et al. detected DNA hypomethylation within the CTLA4 and PDCD1 genes. Said genes were overexpressed at the transcriptome level. Since these are therapeutically important checkpoint regulators, this work is of great therapeutic relevance (Marwitz et al. [2017] Clin. Epigenet. 9: 51).
The diagnostic potential of DNA methylation also becomes clear from the example of the “Epi proLung” assay (“Epigenomics AG”, Germany). In this case, the cfDNA methylation pattern of the SHOX2 and PTGER4 genes is analyzed. At a specificity of 90%, the sensitivity is 67% (Weiss et al. [2017] J. Thorac. Oncol. 12: 77-84). Therefore, the sensitivity of the “Epi proLung” test is insufficient for reliable lung cancer screening. As yet, there are no further liquid biopsy-based methods which allow reliable, preventive early detection of lung cancer.
In comparison, the inventors addressed the problem of providing a more reliable method for diagnosing lung cancer. This problem is solved by the invention, especially by the subject matter of the claims.
One aspect of the invention is a method for diagnosing lung cancer, wherein the methylation of a set of methylation markers in a sample of a patient is determined, wherein, e.g., cfDNA from a liquid biopsy can be examined. Alternatively, the sample can also be a tissue sample, e.g., a solid tissue sample from a tumor or from a tissue in which a tumor is possibly present. In particular, the tissue sample can originate from a biopsy or surgical material of lung tissue. Pleural fluid can be examined, too. The method according to the invention is distinguished by the fact that, owing to the selection of markers, it is particularly well suited to being used for examination of tissue samples taken during surgery, for examination of lung biopsy tissue and for examination of cfDNA from a liquid biopsy. In the context of the invention, surgeries in which tissue is collected as a sample will usually be surgeries for removal of a diagnosed lung tumor. Even then, however, questions will still arise, which the method according to the invention can answer, for instance about the entity and/or prognosis of the tumor or in relation to the demarcation between tumor tissue and adjacent normal tissue.
The invention provides a method for diagnosing lung cancer, wherein the methylation of a set of methylation markers, e.g., in cfDNA from a liquid biopsy sample of a patient, is determined, wherein, optionally, an alignment against a reference genome using the Segemehl algorithm is carried out.
The invention further provides a method for diagnosing lung cancer, wherein the methylation of a set of methylation markers, e.g., in cfDNA from a liquid biopsy sample of a patient, is determined, wherein, optionally, the methylation of methylation markers in the genes SERPINB5, DOCK10, PCDHB2, HIF3A, FGD5, RCAN2, HOXD12, OCA2, SLC22A20, FADL-1, NRXN1, ACOXL, FAM53A, UBE3D and AUTS2 is determined.
For minimally invasive diagnostics of lung tumors (lung carcinomas), according to the invention, use is made of, e.g., the circulating cell-free DNA (cfDNA) from liquid biopsies, e.g., from plasma, blood or serum, preferably from plasma. If a patient is suffering from a malignant tumor disease, the total amount of circulating DNA also contains the tumor DNA, which contains all therapeutically and prognostically relevant information about the genetic and epigenetic characteristics of the tumor. The invention provides both preferred methods for diagnosing lung cancer on this basis and preferred sets of methylation markers.
In the context of the invention, it was shown that the methylation signatures in solid tumors, e.g., in samples from surgeries or biopsies, partly differ from the signatures from cfDNA from liquid biopsies. This can explain why the abovementioned “Epi proLung” study, in which the cfDNA methylation profile within the SHOX2 and PTGER4 genes was analyzed, exhibited, at a specificity of 90%, only a sensitivity of 67% (Weiss et al. [2017] J. Thorac. Oncol. 12: 77-84). The SHOX2 and PTGER4 biomarkers used originate from analyses of primary tumor tissue (Murn et al. [2008] J. Exp. Med. 205: 3091-3103; and Schneider et al. [2011] BMC Cancer 11: 102). However, the present invention clearly shows (see section 2.1.3) that the DNA methylation patterns correlate only to a limited extent between the cfDNA from the plasma and the gDNA from a primary tumor. Indeed, the total amount of cfDNA contains not only DNA derived from the lung or a tumor, but also DNA from further tissues and organs.
This means that the strongly aberrant methylated DNA regions in the primary tumor tissue do not necessarily exhibit differential methylation in the plasma. Therefore, it is not sufficient for the development of a noninvasive, cfDNA-based early-detection test to use known biomarkers from the primary tumors. Instead, it is necessary to identify novel cfDNA-specific, strong and unambiguous methylation signatures in the plasma of the affected patients. However, cfDNA-specific methylation signatures are in return also not necessarily suitable for diagnosis and examination of tissue samples. Therefore, the goal was - in distinction to the approaches known in the state of the art - to determine universal methylation signatures, by means of which very different (also complex) patient samples (also with greatly varying content of tumor cells) can be examined robustly and reliably. This was achieved using the present invention. According to the invention, it is advantageous that the identified markers provide good results both with tissue samples, e.g., solid tissue samples from tumor tissue, and with liquid biopsies and are thus suitable for diagnosing lung cancer from various types of samples.
To identify a set of methylation markers according to the invention that comprises particularly informative differentially methylated regions, multiple steps were carried out in the context of the invention, which are described in detail in the Example section. First, DNA methylation signatures were examined in 40 malignant lung tumors and their corresponding controls. DNA methylation signatures were then analyzed in the blood plasma of nine patients. Of these, five patients were suffering from adenocarcinoma of the lungs and four from squamous cell carcinoma of the lung. By contrast, the remaining patients were free of malignant diseases and formed the control cohorts. Finally, additional data sets from multiple studies that have been made available were evaluated, which made it possible to identify further tumor-specific and prognostic CpG loci. The set of methylation markers synthesized on this basis, also referred to as plasma panel (see Table 1), was subsequently validated in the context of a pilot study. Said set of methylation markers comprises a plurality of regions which, e.g., are differentially methylated in cfDNA and, surprisingly, allow for a specific statement about the presence of a tumor, the tumor entity, the tumor stage and/or the prognosis.
In one embodiment, the invention therefore relates to a method for diagnosing lung cancer, in which the methylation of a set of methylation markers in a sample of the patient is determined, wherein the set of methylation markers is selected from the group consisting of the regions listed in Tables 1a, 1b and 1c and comprises at least 60 regions, preferably at least 64 regions, more preferably at least 340 or at least 350 regions, most preferably at least 630 regions. For example, methylation markers can be determined to determine the presence of a tumor.
The invention also relates to a method for diagnosing lung cancer, in which the methylation of a set of methylation markers in a sample of the patient is determined, wherein the set of methylation markers is selected from the group consisting of the regions listed in Tables 1a, 1b and 1c and comprises at least 134 regions, preferably 138 regions, more preferably at least 240 regions, most preferably at least 247 regions. For example, methylation markers can be determined to determine the entity of a tumor.
According to the invention, the set of methylation markers can comprise at least 194 regions, preferably at least 600 regions, optionally all 630 regions. For example, at least 60, preferably at least 64 methylation markers can be determined to determine the presence of a tumor, e.g., methylation markers from Table 1a, and at least 134, preferably 138 methylation markers can be determined to determine the entity of the tumor, e.g., methylation markers from Table 1b. The more methylation markers are determined, the more accurate the analysis. Therefore, at least 150, preferably at least 340 or even 350 methylation markers can also be determined to determine the presence of a tumor, e.g., methylation markers from Table 1a, and at least 240 or even 247 methylation markers can be determined to determine the entity of the tumor, e.g., methylation markers from Table 1b. Optionally, at least 15, preferably at least 30 or even 33 methylation markers from Table 1c can be additionally determined to determine the prognosis.
In one embodiment, the invention therefore relates to a method for diagnosing lung cancer, in which the methylation of a set of methylation markers in a sample of a patient, e.g., in cfDNA from a liquid biopsy sample of a patient, is determined, wherein the set of methylation markers comprises at least 60 regions selected from the group consisting of:
The aforementioned methylation markers are the markers mentioned in Table 1a,which were identified only in cfDNA. In this analysis, the presence of a tumor is preferably examined, wherein the set of methylation markers optionally comprises all the regions of the group.
In this context, the set of methylation markers can comprise at least 340 regions selected from the group consisting of the regions listed in Table 1a, wherein the set of methylation markers preferably comprises all the regions listed in Table 1a.
In one embodiment of the abovementioned methods, the set of methylation markers comprises at least 134 regions selected from the group consisting of
The aforementioned methylation markers are the markers mentioned in Table 1b,which were identified only in cfDNA. In this analysis, the entity of a tumor is preferably examined, wherein, in particular, a distinction can be made between adenocarcinoma and squamous cell carcinoma. In this context, the set of methylation markers can comprise all regions of the group.
In this analysis, the set of methylation markers can also comprise at least 240 regions, wherein the group consists of the regions listed in Table 1b. Preferably, the set of methylation markers comprises all regions of the group listed in Table 1b.
Since it has been shown that all the regions defined in Tables 1a and 1b are differentially methylated in the samples examined, it is advantageous to analyze all regions defined in Tables 1a and 1b, especially if both the presence and the entity of a potential tumor are to be analyzed.
The validity of the analysis is greatest if the set of methylation markers comprises at least 620 regions from a group consisting of all regions listed in Table 1, especially if the prognosis is further determined, preferably if the set of methylation markers comprises allregions of the group.
During further analysis of the data and verification on the basis of cfDNA from patients, a second set of methylation markers having various subgroups was identified in the context of the invention, by means of which different questions can be answered (see Tables 2-4). The corresponding methylation markers are defined differentially methylated positions which lie in the regions mentioned in Table 1. The methylation markers mentioned in Tables 2-4 thus represent suitable subgroups for examination of the methylation markers contained in the plasma panel.
Thus, in the context of the invention, either differentially methylated regions, e.g., the regions defined in Tables 1a, 1b, and/or 1c, can serve as methylation markers, or differentially methylated positions. In this regard, the analysis of entire regions leads to more reliable results, since specific positions need not necessarily have the same informative value in the case of particular patients. For this, an analysis of specific positions is possible with less effort, e.g., via an array, and is therefore favorable if a cost-effective diagnosis is to be made. The choice is therefore based on a consideration of the reliability required in the particular case and the possible effort. Evidently, both types of methylation markers can also be used simultaneously for diagnosis. Furthermore, the amount of sample available also plays a role, since especially tissue samples from surgeries contain amounts of DNA sufficient for carrying out an analysis of individual methylated positions via an array.
Particularly informative methylation markers identified in this context lie, in some cases, within the genes SERPINB5, DOCK10, PCDHB2, HIF3A, FGD5, RCAN2, HOXD12, OCA2, SLC22A20, FADL-1, NRXN1, ACOXL, FAM53A, UBE3D and AUTS2. Said genes had hitherto never been specifically described in connection with lung carcinomas or certain NSCLC entities.
The role of some of these genes in tumor evolution and prognosis is known in other cancer types. SERPIN5 is, e.g., a known oncogene (Lei et al. [2011] Oncol. Rep. 26: 1115-1120). HOX genes are aberrantly expressed in many cancer types (Bhatlekar et al. [2014] J. Mol. Med. 92: 811-823). Dysregulation of RCAN2 leads to proliferation of tumor cells (Niitsu et al. [2016] Oncogenesis 5: e253). In some studies, altered expression of DOCK10 had resulted in the migration of melanoma cells (Gadea et al. [2008] Curr. Biol. 18: 1456-1465). Some OCA2 mutations are associated with an increased risk of melanoma, too (Hawkes et al. [2013] J. Dermatol. Sci. 69: 30-37). Furthermore, HIF3A and FGD5 are important angiogenesis regulators and therefore play a crucial role during tumor evolution (Jackson et al. [2010] Expert Opin. Therap. Targets 14: 1047-1057); and Kurogane et al. [2012] Arterioscler. Thromb. Vasc. Biol. 32: 988-996). The DNA methylation of some PCDHB2-CpG loci is associated with a poor prognosis of neuroblastoma patients (Abe et al. [2005] Cancer Res. 65: 828-834). Altered metabolism is, e.g., a characteristic of malignant tumors; in this case, the FADL-1 fatty acid transporter and some SLC transporters may play an important role (Lin et al. [2015] Nat. Rev. Drug Discov. 14: 543-560; and Black [1991] J. Bacteriol. 173: 435-442). UBE3D encodes a ubiquitin protein ligase. Several studies have shown that some ubiquitin protein ligases may play an important role during tumor evolution (see, inter alia, Lisztwan et al. [1999] Genes Dev. 13: 1822-1833). AUTS2 and NRXN1 are neural genes. Overexpression of AUTS2 has been demonstrated in liver metastases (Oksenberg & Ahituv [2013] Trends Genet. 29: 600-608). NRXN1 might be responsible for nicotine addiction (Ching et al. [2010] Am. J. Med. Genet. B. Neuropsychiatr. Genet. 153B: 937-947). Increased expression of ACOXL has already been described in prostate carcinomas (O′Hurley et al. [2015] PLoS One 10: e0133449). Some studies describe FAM53A as a prognostic and therapeutic breast carcinoma marker (Fagerholm et al. [2017] Oncotarget 8: 18381-18398). However, the aforementioned studies do not allow any conclusions that a methylation in these genes, let alone in the positions mentioned in Tables 2-4, correlates with a lung cancer disease and can accordingly be used as a diagnostic marker for the presence of lung tumors or for the establishment of the entity or for the determination of the tumor stage.
Thus, the invention provides, for the first time, a method for diagnosing lung cancer, wherein the methylation of a set of methylation markers, e.g., in cfDNA from a liquid biopsy sample of a patient, is determined, wherein the methylation of methylation markers in the genes SERPINB5, DOCK10, PCDHB2, HIF3A, FGD5, RCAN2, HOXD12, OCA2, SLC22A20, FADL-1, NRXN1, ACOXL, FAM53A, UBE3D and AUTS2 is determined.
Preferably, said methylation markers comprise the methylation markers mentioned in Table 2, especially if the presence of a lung carcinoma is to be determined. Alternatively, especially if the entity of a lung carcinoma is to be determined, and especially if a distinction is to be made between adenocarcinoma and squamous cell carcinoma NSCLC types, the methylation markers comprise the methylation markers mentioned in Table 3. Preferably, both the methylation markers mentioned in Table 2 and those mentioned in Table 3 are determined to answer both questions. Optionally, the methylation markers mentioned in Table 4 can furthermore also be analyzed, which further allows conclusions to be drawn about the stage of the tumor.
Thus, the invention provides furthermore a method for diagnosing lung cancer, in which the methylation of a set of methylation markers, e.g., in cfDNA from a liquid biopsy sample of a patient, is determined, wherein the set of methylation markers comprises the following 10 positions (see also Table 2):
It has been demonstrated that said markers are particularly informative if the kNN algorithm is used for analysis. Using said markers, especially the presence of a tumor can be analyzed.
Alternatively or additionally, the set of methylation markers can comprise the following 10 positions (see also Table 3):
It has been demonstrated that said markers are particularly informative if the RT algorithm is used for analysis. Using said markers, especially the entity of a tumor can be identified.
Optionally, especially if, furthermore, the stage of a tumor is to be identified (e.g., a distinction is to be made between an early (I+II) and a late (III+IV) stage of a lung carcinoma), the set of methylation markers can furthermore comprise all the positions listed in Table 4. In this case, the SVM algorithm can be used for analysis.
In the case of regions which could not be validated using samples from early lung carcinoma stages, could be signatures specific for metastases, for example. Therefore, said regions were used for calculation of the staging parameter, i.e., for calculation of the stage. So far, the staging parameter described in this work can distinguish the late stages of lung carcinoma from early stages with 80% accuracy. In general, the staging parameter should only be used as an indication. If the developed panel detects a lung carcinoma, it would be additionally advisable to generate therapeutically relevant information, e.g., with regard to the size or location of the tumor, by imaging methods, such as, e.g., MRI, CT or PET CT. It is thus also not essential to coanalyze the stage-based methylation markers in each case.
In the context of the invention, the lung cancer can be NSCLC or SCLC, preferably NSCLC. The NSCLC is preferably an adenocarcinoma or squamous cell carcinoma. It has been demonstrated that markers according to the invention can differentiate between these entities and are therefore suitable for differential diagnosis.
The diagnosis according to the invention makes it possible to state the presence of a tumor, the entity of a tumor (especially the differentiation between adenocarcinoma and squamous cell carcinoma), the tumor stage and/or the prognosis. Most important is the statement about the presence and entity of the tumor. Further statements can optionally also be made by means of supplementary methods, if the presence of a tumor has been established according to the invention. However, the method according to the invention optionally also allows already a statement about the presence of a tumor, the entity of a tumor (especially the differentiation between adenocarcinoma and squamous cell carcinoma) and the tumor stage and preferably the prognosis. The term of diagnosis thus includes differential diagnosis.
In contrast to hitherto known methods, the method according to the invention is also suitable for early detection of lung cancer, i.e., also for diagnosis in stage I or II. Advantageously, said diagnosis is furthermore also possible on the basis of a liquid biopsy sample, i.e., for example a blood sample, so that other tissue does not necessarily have to be removed from the patient.. According to the invention, e.g., a liquid biopsy sample of a patient is therefore analyzed.
In addition, the method according to the invention can advantageously also be reliably carried out on the basis of lung biopsy tissue. In this case, it is also possible to carry out a “paired biopsy” and to therefore examine and compare in parallel tissue from lung biopsies of the presumably diseased lung and the presumably healthy lung of a patient. In the clinic, usually only the tumor or suspicious tissue is biopsied, with previously collected data sets of healthy tissues serving as a reference if necessary.
Preferably, the patient is a human being. In general, the word patient is used synonymously with subject. It may be a patient with symptoms suggesting that the patient has a lung tumor. However, it may also be a subject without symptoms. The subject or patient can be a patient at risk of a lung tumor. These include subjects who, because of certain risk factors and/or their lifestyle (e.g., smoking, use of e-cigarettes or other increased exposure to carcinogenic agents, symptoms), have an increased risk of a lung cancer disease and/or exhibit radiological abnormalities. The patient may also be a patient with a previously treated lung tumor, such as one who has undergone surgery, in which case tumor recurrence and/or metastasis may be investigated.
In general, the cfDNA can be extracted from a plurality of body fluids. For example, successful extraction from blood plasma and serum, pleural effusion or urine has already been described in the literature. According to the invention, the liquid biopsy sample can be blood, plasma, serum, sputum, bronchial fluid and pleural effusion. Preferably, it is derived from blood, e.g., serum or plasma, preferably plasma. Since pleural effusion only occurs in the course of the disease, this material is especially suitable for the detection of later stages. cfDNA extraction from plasma or serum is distinctly more rapid and cost-effective than from urine, which makes these materials more interesting for screening. Lastly, cfDNA stability is relevant, since cfDNA is more stable in plasma than in serum.
In one embodiment, the invention provides means which are suitable for diagnosing lung cancer using a method according to the invention by examination of the methylation of a set of methylation markers, e.g., in cfDNA from a liquid biopsy sample of a patient. The means are preferably also suitable for diagnosing lung cancer using a method according to the invention by examination of the methylation of a set of methylation markers in a different sample of a patient, especially a solid tissue sample from a tumor or a tissue in which a tumor is suspected or from a lung biopsy.
In this context, the means comprises oligonucleotides which can hybridize to DNA (e.g., cfDNA or DNA derived therefrom, e.g., by bisulfite conversion) which comprises or consists of methylation markers according to the invention. Methylation markers from the subgroups mentioned in the claims are preferred in this context. “Can hybridize” is to be understood to mean a specific hybridization, especially under stringent conditions, as outlined in the experimental section for instance.
Suitable oligonucleotides are, e.g., oligonucleotides which can hybridize to the regions mentioned in Table 1a, 1b and/or 1c, preferably in Table 1a, because they are complementary to these regions or a fragment thereof which comprises at least 20 nucleotides, e.g., when coupling to a solid support, preferably 60-352, optionally 100-190 or 135-157 nucleotides. For this, the length depends, inter alia, on the base composition or sequence and the hybridization temperature and on the technique selected. Since the DNA is double-stranded, the oligonucleotides can be complementary to the strand in the 5′-3′ direction or to the strand in the 3′-5′ direction, or to both. What is important is that the selected oligonucleotides cannot hybridize to regions other than those mentioned in the tables, which is likewise a prerequisite for a specific hybridization. Exemplary suitable oligonucleotides which can hybridize to the regions on Chromosome 1 mentioned in Tables 1a, 1b and 1c are listed in Table 5. A person skilled in the art is capable of selecting oligonucleotides suitable for other markers on the basis of the information disclosed herein about the markers.
Such oligonucleotides can optionally comprise further components, e.g., spacers or linker regions.
The oligonucleotides according to the invention can, e.g., be coupled to a solid support or are oligonucleotides which have been coupled to a solid support. Such coupling is, e.g., possible by means of adapters or tags. One option for this is coupling to biotin, which can bind (or has already bound) to streptavidin or avidin, which is coupled to the solid support.
The solid support can, e.g., be a gene chip, a globule or bead, e.g., a magnetic bead, or a column matrix. The support thus allows simple separation of the hybridized DNA. In the Example section, magnetic beads are described, which have been coupled via streptavidin-biotin binding to oligonucleotides which specifically hybridize to the regions mentioned in Table 1 and can be used as capture probes. Optionally, the means according to the invention comprise 638 oligonucleotides, e.g., capture probes, which can hybridize to all the methylation markers mentioned in Table 1.
Alternatively or additionally, the oligonucleotides according to the invention may also be a kit comprising PCR primers for amplification of regions which comprise the methylation markers or (especially in the case of regions from Table 1) consist thereof. PCR primers preferably have a length of approx. 12-40, optionally 15-25 nucleotides, which can hybridize to said regions. Such a kit can also comprise blocking oligonucleotides or detection probes, which, after bisulfite conversion, can specifically bind to previously methylated DNA or unmethylated DNA. Such oligonucleotides can, e.g., be used in PCR-based methods according to the invention.
An analysis by PCR is especially appropriate if only a limited number of markers is to be analyzed, i.e., for example the markers in the abovementioned genes. Preferably, this method analyzes the markers defined in Table 2, alternatively or additionally also the markers defined in Table 3, so that appropriate oligonucleotides can be selected accordingly.
Optionally, one or more primers suitable for multiplex PCR can be selected. Probes for detection are preferably labeled with suitable dyes.
The invention also provides a method in which the means according to the invention are used for diagnosis of lung cancer in a sample of a patient, wherein optionally cfDNA from a liquid biopsy sample of a patient (also referred to as subject) is examined. Owing to the selection of markers, other samples, e.g., from biopsies and bronchoscopies or from tissue samples collected during surgery, can, however, also be examined using the means according to the invention, especially using those which comprise markers from Table 1 a, b and/or c, preferably all the markers from Tables 1a and 1b and optionally also from Table 1c. Biopsies can also be collected from the outside if necessary under imaging.
If sequencing data are to be used, the bioinformatic evaluation pipeline poses a further problem. Conventional gDNA-WGBS libraries are usually aligned using the “Bismark” algorithm after processing. The results of the alignment can then subsequently be analyzed by numerous evaluation pipelines, with genome-wide DNA methylation signatures being extracted. The WGBS experiment of the circulating-DNA carried out in the exemplary embodiments was the first of its kind. It was found that the cfDNA libraries have a different complexity as well as fragment distribution compared to conventional gDNA libraries (see section 1.1.2.5). This might be the reason why the “Bismark” algorithm most commonly used in the prior art provided an unsatisfactory mapping efficiency of only 70%. It is for this reason that further algorithms were tested. The best results, with a mapping efficiency of at least 98%, were provided here by the “Segemehl” algorithm (see section 1.1.2.5).
Therefore, in the embodiment of the invention that is based on sequencing of bisulfite-converted cfDNA, the Segemehl algorithm is particularly used to align (i.e., to arrange) the sequencing information of the cfDNA with respect to a reference genome. The Segemehl algorithm is found under https://www.bioinf.uni-leipzig.de/Software/segemehl/ and is described in more detail in, e.g., Otto et al. (Otto et al. [2012] Bioinformatics 28: 1698-1704). Version 0.2.0 can be used, as in the example described below, but also another version, such as 0.3.4..
Another aspect of the invention provides a method according to the invention for diagnosing a lung tumor, comprising the following steps:
Means and methods for extracting genomic DNA, for extracting cfDNA from plasma, quantification, quality control (QC) and bisulfite conversion are known to a person skilled in the art from the state of the art and/or described herein.
The converted DNA, e.g., cfDNA, can be used for the production of the libraries. Library preparation is done in two steps. In the first step, e.g. as described in section 1.1.2.4, a WGBS Library is produced from each sample, which contains information about the entire methylome or the zfDNA methylome of the corresponding patient. However, as only the specific, differentially methylated regions are sequenced and analyzed in the further course, these can be enriched from the entire methylome. This can be done as the second step on the basis of the Whole Genome Bisulfite Sequencing Library.
Various sets of methylation markers according to the invention can be used for enrichment, e.g., the markers identified in cfDNA for the first time in the context of the present work from Table 1a, all markers from Table 1a, alternatively or additionally the markers from Table 1b and/or 1c. It is, however, also possible to use only methylation markers for which particular significance has been found in the context of the classification, especially for the presence of a tumor (Table 2) or for the determination of the entity of the tumor (Table 3), but optionally also for the determination of the tumor stage (Table 4).
For enrichment, e.g., capture probes can be used. Said capture probes can cover the entire plasma panel or parts thereof (see section 1.2.1).
The enriched library can be subjected to a QC as well as quantified (see section 1.1.2.2). It is preferably sequenced, e.g., on the “MiSeq” (“Illumina”, USA) (see section 1.2.2). The sequencing data can, e.g., be stored in “FastQ” format and subsequently be analyzed (see, for example, section 1.2.3). Preferably, not the entire methylome is to be analyzed, but only defined methylation markers. Preferred methylation markers are, e.g., the 638 regions defined in Table 1 (plasma panel).
As mentioned, for the analysis, especially the Segemehl algorithm is used for alignment against a reference genome. Thereafter, the methylation patterns are calculated.
The format of the “Segemehl” output file is one that is different from the typical “Bismark” format. Therefore, a suitable “Segemehl″-compatible analysis pipeline may be used. In this context, e.g., the “Bisulfite Analysis Toolkit” can be mentioned by way of example. This software of modular construction can be used on numerous computing clusters and expanded by further software as well as own scripts. For the identification of the differentially methylated markers suitable for diagnosis of lung cancer, the analysis pipeline can be supplemented with own bioinformatic scripts, e.g., the ones disclosed herein.
As an alternative to the diagnostic method via sequencing, it is also possible, on the basis of the results according to the invention, to carry out an analysis via PCR. This is especially relevant to smaller subgroups of the defined markers, e.g., if initially a sample of a patient is to be examined only for the presence of a tumor and/or the determination of the tumor entity. In this case, e.g., suitable primers can be used to amplify regions of the e.g., cfDNA and to detect the positions mentioned in Table 2 and/or 3. This can be done from purified, bisulfite-converted DNA, e.g., by real time PCR. Multiplex PCRs or parallel mixes can, however, also be used.
As internal control, e.g., beta-actin can be analyzed to check whether the amount of total DNA in the sample is sufficient. For this, e.g., cfDNA from a liquid biopsy, preferably from plasma, can be purified, bisulfite-converted and again purified, as described, e.g., in the exemplary embodiments. Blockers and detection probes can further be used for PCR that specifically recognize the bisulfite-converted unmethylated sequences within the regions and block their amplification so that the methylated sequences are preferentially amplified. Methylation-specific probes then exclusively detect methylated sequences which were amplified during the PCR.
Comparable methods are already described, e.g., for the Epi proLung Kit (Epigenomics AG, Berlin), and can be adapted for the methylation markers relevant according to the invention, e.g., from Tables 2 and 3. Evidently, it is also possible to additionally examine further methylation markers, e.g., from the plasma panel, with this method, e.g., more than 25 differentially methylated positions or more than 30 differentially methylated positions, preferably comprising the methylation markers mentioned in Tables 2 and 3 and/or lying within the regions mentioned in Table 1, preferably both.
The methylation patterns established in the sample of a patient (via sequencing-based methods or PCR-based methods), i.e., the results of the methylation marker analysis, can be correlated with the patterns known herein for tumors, optionally a certain entity and/or a certain stage, as specified, e.g., in the tables. According to the invention, this allows conclusions to be drawn about the presence, entity, stage and/or prognosis of a lung tumor, thus permitting a reliable advanced diagnosis.
According to the invention, this diagnosis can be used for selecting a therapy or for deciding on the commencement of a therapy in the event of a tumor being present.
In one embodiment, the invention thus also relates to a method for treating a lung tumor, comprising a diagnostic method according to the invention, wherein, in the event of a tumor being present, said tumor is treated. Advantageously, the entity of the tumor can also be established, allowing the selection of a therapy suitable for, e.g., an adenocarcinoma or a squamous cell carcinoma. A suitable therapy can, e.g., comprise the administration of suitable medicaments or combinations of medicaments and/or irradiation.
Alternatively, the diagnostic method can be used to carry out further diagnostic steps, such as the collection of a solid biopsy and or imaging methods, in the event of a tumor being detected.
Another aspect of the invention provides for the use of a method according to the invention or of a means according to the invention for diagnosing lung cancer, wherein the diagnosis allows a statement about the presence of a tumor, about the entity of a tumor, about the tumor stage and/or about the prognosis, preferably about the presence and entity of the tumor, optionally about all at the same time.
In summary, it can be stated that, in the context of the present invention, it was possible for the first time to develop an NGS panel which is based on, inter alia, genome-wide cfDNA methylation signatures from plasma. Said plasma panel could be successfully validated using liquid biopsies of a patient cohort (n=12). However, the method according to the invention is explicitly distinguished by the fact that, due to the selection of markers, it is also particularly well suited for an examination of, e.g., tissue samples taken during surgery or lung biopsy tissue, in addition to the examination of zfDNA from a liquid biopsy. During the pilot study, the plasma panel distinguished malignant lung tumors with 100% accuracy as early as from stage I, identified the most common NSCLC subtypes and provided further information with regard to determining the stage of the lung tumors (staging).
The invention will be elucidated below by means of examples which are intended to illustrate, but not to limit, the invention. All the references cited in this application are fully incorporated herein by reference in their entirety.
To enable noninvasive lung cancer diagnostics, in the context of the invention, a suitable panel, i.e., a set of methylation markers, was developed for DNA methylation analysis in blood plasma. The set of methylation markers is therefore also referred to as the plasma panel. The development of the plasma panel was carried out in three independent approaches. In the first approach, it was checked whether DNA methylation is generally suitable as biomarker for lung cancer diagnostics (see section 1.1.1). For this purpose, 40 lung carcinomas and the corresponding controls thereof were analyzed using the “Illumina Infinium Human Methylation450K BeadChip” (HM 450K). The method identified distinct, tumor-specific DNA methylation signatures. Next, as described in section 1.1.1, the regions having the strongest differences in DNA methylation were ascertained and incorporated into the panel.
In the second approach, it was examined whether tumor-specific DNA methylation signatures can also be detected in the blood plasma of the patients affected (see section 1.1.2). For this, circulating cell-free DNA was extracted from the plasma of adenocarcinoma (n=5) and squamous cell carcinoma patients (n=4) and subsequently combined into 3 pools. Plasma of a tumor-free patient cohort (n=19) served as control. Detailed information about the patients is compiled in section 1.1.2. As a result of pooling, individual DNA methylation patterns were largely eliminated, and the general tumor- or lung-specific signatures were, by contrast, emphasized. Then, the cfDNA pools were subjected to whole-genome bisulfite sequencing (WGBS; see section 1.1.2.4). The method detected several thousand aberrantly methylated CpG loci which were not only tumor-specific, but also entity-specific. Of these, the most suitable regions were selected for differentiation for the plasma panel (see section 1.1.2.5.5). Since diagnosis according to the invention is preferably to be performed on the basis of liquid biopsies, the methylation markers identified here are of particular significance.
In the third approach, the plasma panel was supplemented by 59 tumor-specific and prognostically relevant CpG loci from further studies (see section 1.1.3).
The HM 450K data set contained information about the methylation status of 40 lung carcinomas (adenocarcinomas and squamous cell carcinomas) and their corresponding controls. The data set was evaluated using the “Qlucore Omics Explorer” software (version 3.2, “Qlucore”, Sweden) and yielded:
To ascertain the CpG loci having the strongest differences in DNA methylation, the two lists were first filtered according to differential methylation greater than 35% (avg. beta > 0.35) and annotated against the “HG19” reference genome using “Bedtools” (version 2.2.6, “The University of Utah”, USA). All CpG loci which were located within common SNPs (≥1% of the population) and were non-protein-coding were discarded. The remaining loci were incorporated into the final plasma panel (Table 1).
According to the invention, circulating cell-free DNA is used for noninvasive diagnostics of solid tumors. If a patient is suffering from a malignant tumor disease, the total amount of circulating DNA also contains the tumor DNA, which contains all therapeutically and prognostically relevant information about the genetic and epigenetic characteristics of the tumor. Therefore, cfDNA must be isolated from blood or blood plasma. Since cfDNA can be extracted from blood plasma only in a very low amounts, a method was chosen for this purpose that very specifically and efficiently enriches zfDNA without isolating further components of plasma.
For this, e.g., the “PME free-circulating DNA Extraction Kit” (“Analytik Jena”, Germany; see section 1.1.2.1) can be used. It contains a polymer which only complexes short-stranded dsDNA fragments highly specifically. The polymer-cfDNA complex is subsequently precipitated and purified. After purification, the complex compound can be disassociated. The released DNA is purified from the polymer and concentrated in further steps, e.g. by binding to a silica column. Other methods based, e.g., on the same or similar principles of action can be used, too. The resultant product is very clean and can also be used for sensitive NGS-based analysis methods such as, e.g., WGBS.
Blood plasma was prepared and shipped on dry ice. For this purpose, whole blood was centrifuged within 30 min of collection at 1500 g for 10 min. After centrifugation, the plasma supernatant was carefully pipetted off, aliquoted into “CryoPure” tubes (“Sarstedt AG&Co”, Germany) and immediately frozen at -80° C.
The frozen plasma samples were slowly thawed under lukewarm water and subsequently centrifuged at 4500 g for 10 min. The pellet was discarded, and the clear supernatant was transferred into a 10 mL tube and processed using the “PME free-circulating DNA Extraction Kit” according to the manufacturer’s instructions.
The cfDNA was quantified fluorometrically using the “Qubit dsDNA High Sensitivity Assay Kit” (“Thermo Fisher Scientific”, USA). For this purpose, 1 µL of each sample was mixed with 198 µL of “Qubit dsDNA HS Buffer” and 1 µL of “Qubit dsDNA HS Reagent”, incubated for 2 min and subsequently measured in the “Qubit 2.0” fluorometer (“Thermo Fisher Scientific”, USA). The “Qubit dsDNA HS Reagent” was a dye which generates a very weak fluorescent signal under normal conditions. However, in the presence of double-stranded DNA (dsDNA), it intercalates into the dsDNA, alters its structure and generates a strong fluorescent signal. Neither single-stranded DNA (ssDNA) nor RNA is bound. Therefore, the signal intensity exclusively correlates with the amount of dsDNA present in the sample.
The quality of the extracted cfDNA was analyzed with the aid of the “Agilent 2100 High Sensitivity DNA Kit” (“Agilent”, USA). The method was capillary gel electrophoresis. First, the “Gel-Dye Mix” had to be prepared. For this 300 µL of the gel matrix were added to 15 µL of the dye concentrate, mixed and transferred to a “Spin Filter”. Centrifugation was carried out at 2240 g for 10 min. Next, the DNA chip was placed and equilibrated in the “Priming Station”. Regarding this, 9 µL of the “Gel-Dye Mix” were pipetted into the well intended for the equilibration process. The plunger of the “Priming Station” was adjusted to one milliliter. After the “Priming Station” was firmly closed, the plunger was depressed for one minute. Lastly, the remaining wells of the chip were loaded according to the manufacturer’s instructions. The chip was incubated for 1 min and directly measured afterwards. During the incubation time, a fluorescent dye present in the “Gel-Dye Mix” intercalated between the bases of the dsDNA. The dsDNA fragments were subsequently drawn through the microscopically small capillaries of the “Agilent 2100 Bionalyzer” (“Agilent”, USA) and, in the course of this, resolved and detected according to fragment size.
For whole-genome analysis of the DNA methylation pattern, e.g., by the HM 450K or WGBS, DNA is subjected to PCR-based whole-genome amplification. DNA polymerases cannot distinguish between cytosines and 5-methylcytosines, so that, during the reaction, all 5-methylcytosines are replaced with cytosines. The newly synthesized strands are not remethylated.
In order to be able to distinguish cytosines from 5-methylcytosines, the sample is subjected to a treatment with sodium bisulfite prior to PCR. This process is referred to as bisulfite conversion, which involves conversion of all unmethylated cytosines into uracils. By contrast, the methylated cytosines remain unaltered under the chosen reaction conditions. The reaction of bisulfite conversion is described in NEB, N.E.B. Bisulfite Conversion (available under: http://www.neb-online.de/wp-content/uploads/2015/04/NEB_epigenetik_bisulfit3.jpg), and in Clark et al. (Clark et al. [1994] Nucl. Acids Res 22: 2990-2997).
The bisulfite conversion of cfDNA can, e.g., be carried out using the “EZ DNA Methylation-Gold™ Kit” (“Zymo Research”, USA). For this, 10 ng of the previously extracted cfDNA were dissolved in 20 µL of water, admixed with 130 µL of “CT” conversion reagent and processed in the thermal cycler under the following program: 10 min at 98° C., 2.5 h at 64° C., up to 20 h at 4° C. In the next step, the bisulfite-converted samples were desulfonated and purified. For this purpose, they were admixed with 600 µL of “M-Binding Buffer”, pipetted onto the “Zymo-Spin™ IC” columns and centrifuged at 10 000 g for 30 s. Then, 100 µL of “M-Wash Buffer” were added to the columns. The columns were centrifuged at 10 000 g for 30 s and treated with 200 µl of “M-Desulphonation Buffer” for 20 min. After subsequent centrifugation at 10 000 g for 30 s, the “Zymo-SpinTM IC” columns were washed with 200 µL of “M-Wash Buffer” and centrifuged at 10 000 g for 30 s to remove remaining liquids, and the DNA was eluted at 10 000 g for 30 s with 15 µL of “Elution Buffer”.
In order to be able to analyze the cfDNA methylation profile at the genome-wide level, the previously bisulfite-converted samples were subjected to WGBS. WGBS is an NGS-based method (next-generation sequencing). Nowadays, there are numerous technologies which make NGS possible. The NGS technology which is the most common and is also used here is offered by “Illumina” (USA). The underlying sequencing reaction is fluorescence-based and is done on a glass support, also called flowcell. To immobilize the DNA fragments on the flowcell, specific “Illumina” adapters (short oligonucleotides) are first ligated. The sample is then subjected to a denaturation reaction. Since not only the adapter binding sites but also primers are present on the flowcell, the ssDNA fragment to be sequenced “folds over”. During the subsequent PCR reaction, the DNA strands are amplified. This process is referred to as bridge amplification. It yields, through the progressive amplification at delimited positions, so-called sequencing clusters, which subsequently dissociate. Cluster formation is followed by the actual sequencing reaction, during which there is incorporation of DNA bases which generate fluorescent signals of different wavelengths depending on the base incorporated. After every completed incorporation cycle, said fluorescent signals are detected and thus provide the information about the base sequence within a read.
Different “Illumina” platforms can be used depending on the desired throughput. For the sequencing of specific regions, so-called panels, such as the panel or set of methylation markers identified according to the invention, the relatively rapid and relatively cost-effective “MiSeq” platform is generally sufficient. However, sequencing can, e.g., also be carried out on the “NextSeq 500” or “HiSeq” sequencing platforms or other suitable sequencing platforms.
During bisulfite conversion, DNA is highly stressed by the reagents used and thus degraded to a high degree. This is why conventional WGBS protocols use very high amounts of DNA, at least 500 ng. Since cell-free, circulating DNA is, on the one hand, already very highly fragmented from the beginning and can, on the other hand, only be obtained in a very low amount, the production of WGBS libraries using conventional kits is difficult at present.
Therefore, the “Accel-NGS® Methyl-Seq DNA Library Kit” (“Swift Biosciences”, USA) was established for the following experiments. The kit was specifically developed for WGBS of cfDNA. Even with zfDNA amounts of less than 10 ng, complex WGBS libraries can be generated. The central role is played by the enzyme “Adaptase”, which adds a 10 nt long overhang at the 3′ end of the bisulfite-converted ssDNA. Said overhang allows better ligation of the sequencing adapters and thus more efficient library production. Therefore, according to the invention, a method for the preparation of the WBGS libraries is preferably used, which inserts a 10 nt long overhand at the 3′ end of the bisulfit converted ssDNA by means of the enzyme adaptase.
Library production was carried out in four steps using the “Accel-NGS® Methyl-Seq DNA Library Kit” (“Swift Biosciences”, USA): treatment with the enzyme “Adaptase”, extension, ligation, PCR. For the treatment with the enzyme “Adaptase”, 10 ng of bisulfite-converted cfDNA were taken up in 15 µL of water and denatured at 95° C. for 2 min. Then, 25 µL of the “Adaptase Reaction Mix” were added to the sample, carefully mixed and processed in the thermal cycler (program 1: 37° C. for 15 min; 95° C. for 2 min; 4° C.; for all programs, the lid of the thermal cycler was preheated). Next, extension was carried out. For this purpose, the sample was admixed with 44 µL of “Extension Reaction Mix”, carefully mixed and incubated in the thermal cycler (program 2: 98° C. for 1 min; 62° C. for 2 min; 65° C. for 5 min; 4° C.).
The product was purified. For this, e.g., “SPRI Beads” (“Beckman Coulter”, USA) can be used. This was followed by ligation, for which 15 µL of the product were admixed with 15 µL of “Ligation I Reaction Mix” and processed in the thermal cycler (program 3: 25° C. for 1 min; 4° C.). Also in this step, the finished product was purified using “SPRI Beads” (“Beckman Coulter”, USA). Lastly, PCR was carried out. For this, 5 µL of the respective index and 25 µL of the “Indexing PCR Reaction Mix” were added per sample. The finished PCR reaction was incubated in the thermal cycler (program 4: 98° C. for 30 s; PCR cycles: 98° C. for 10 s; 60° C. for 30 s; 68° C. for 1 min (7-9 cycles); 4° C.) and purified by means of the “SPRI Beads” (“Beckman Coulter”, USA) according to the manufacturer’s instructions.
The finished WGBS libraries were quantified and tested for their quality as described in section 1.1.2.2.
The samples were transferred into 1.5 mL Eppendorf reaction tubes and admixed with “SPRI Beads” (“Beckman Coulter”, USA) in the prescribed ratio (Tab. A). Then, the samples were mixed and incubated at room temperature for 5 min. Since the beads were magnetic, the principle of magnetic separation could be used for pelleting. For this purpose, the reaction tubes were placed on a magnetic stand and then incubated at room temperature for 2 min. After incubation, the supernatant was removed, and the beads were washed with twice with 500 µL each of 80% ethanol (“Merck Millipore”, USA) and subsequently air-dried. Once the ethanol had evaporated, the samples were removed from the magnetic stand. The “SPRI Beads” were resuspended in the prescribed amount of “Low EDTA TE” buffer (Tab. A) and incubated at room temperature for 2 min. Lastly, the samples were re-placed on the magnetic stand. After ca. 2 min, complete separation of the supernatant and the “SPRI Beads” took place. The supernatant contained the purified product, was pipetted off and used for the next step.
The sequencing of the WGBS libraries was done on the “NextSeq 500” platform (“Illumina”, USA) in the “TATAA-Biocenter” (Gothenburg, Sweden). This involved carrying out four 76 pair end (PE) runs in high-throughput mode.
The WGBS libraries could not be prepared using conventional protocols due to the high fragmentation and low amounts of zfDNA. The cfDNA libraries produced using the “Accel-NGS® Methyl-Seq DNA Library Kit” (“Swift Biosciences”, USA) therefore exhibited a different complexity and fragment distribution compared to conventional WGBS libraries. Therefore, a suitable bioinformatic evaluation pipeline also had to be established to be able to optimally analyze the data.
In general, multiple steps have to be established to be able to evaluate WGBS data (
If the quality control provides satisfactory results, trimming of the adapter sequences takes place. For the zfDNA libraries, the 10 nt long overhang generated by the “Adaptase” also had to be eliminated from the raw data (see section 1.1.2.5.2).
After trimming, the reads can be arranged against a reference genome of choice; this process is also referred to as alignment (see section 1.1.2.5.3). For alignment, there are many algorithms available. Depending on the nature of the WGBS Library, the appropriate one must be selected and optimized. For this purpose, mapping efficiency can be analyzed. This involves calculating what percentage of analyzed reads can be assigned to the reference genome. For conventional WGBS libraries, the “Bismark” algorithm is most commonly used (Krueger & Andrews [2011] Bioinformatics 27: 1571-1572). However, in the case of the cfDNA libraries described herein, “Bismark” (version 0.15.0, “Babraham Institute”, England) did not provide satisfactory results (mapping efficiency of approx. 70%). Therefore, further algorithms were tested.
The best results with a mapping efficiency of at least 98% were provided by the “Segemehl” algorithm (version 0.2.0, “Interdisciplinary Centre for Bioinformatics, Leipzig University”, Germany) (Otto et al. [2012] Bioinformatics 28: 1698-1704).
After alignment, the data are filtered according to CpG context and the desired coverage (at least fourfold), e.g., with the “Bisulfite Analysis Toolkit” (version 0.1, “Interdisciplinary Centre for Bioinformatics, Leipzig University”, Germany), and are only then used for peak calling (see section 1.1.2.5.3). Coverage, also called sequencing depth, specifies how frequently a position was read during sequencing. For example, an average coverage of 100-fold states that each sequenced base was read on average 100 times. Peak calling is the actual step in which the methylation status of the particular CpG is calculated. This involves looking at all reads which contain a certain CpG, calculating the ratio of cytosine to uracil, and outputting the result as a number between 0 and 1, wherein 0 corresponds to a methylation of 0% and 1 to a methylation of 100%.
Conventional libraries have an average coverage of 30 to 40-fold, which is also what the conventional methods for peak calling are designed to do. The zfDNA libraries had an average coverage of 8 to 10-fold due to their lower complexity.. Accordingly, filtering and peak calling had to be optimized, e.g. with the “Bisulfite Analysis Toolkit”..
Once the DNA methylation rates are established, further specific analyses can be done in a programming language of choice depending on the question asked. For the analyses described herein, “R” (version 3.2.0, “R Foundation for Statistical Computing”, Austria), “Perl” (version 5.26.0, “The Perl Foundation”, USA) and “Python” (version 3.3.6, “Python Software Foundation”, USA) were used (see section 1.1.2.5.3).
Since the analyses described herein required very high computing capacity, they were done on an “NEC HPC Linux Cluster”. The front-end processor was accessed via an SSH connection using “MobaXterm Personal Edition” software (“Mobatek”, France).
The raw data were provided in “FastQ” format. This is a text-based format which is used for storing of the reads as well as associated quality parameters. To check the quality of the sequencing, “FastQC” software was used.
The raw data were processed using “Cutadapt” software (version 1.9.1, “TU Dortmund”, Germany) (Martin EMBnet.journal 17). This involved carrying out two steps.
Subsequent data analysis was carried out using the “Bisulfite Analysis Toolkit” [201]. The function of this modularly constructed Software is depicted in
Alignment was carried out against the “HG19” reference genome. Several algorithms were tested, but surprisingly the “Segemehl” algorithm provided the best results (cf. section 1.1.2.5). The algorithm is based on searching for an optimal hit in the reference genome (Hoffmann et al. [2009] PLoS Comput. Biol. 5: e1000502). The maximum permitted number of inaccuracies per read (e.g., insertions, deletions, point mutations) was 10%. All hits which fell short of this threshold value were admitted to semiglobal alignment. Ultimately, only the reads with an accuracy of at least 90% were listed in a final file and used for further analyses.
The “BAM” format preferably used in this context is a compressed version of the “SAM” file, a text-based format which is generated by the algorithm for storing of results of the alignment. Mapping efficiency was statistically evaluated using, e.g., the “BAT_mapping_stat” module (Kretzmer et al. [2017] F1000Res. 6: 1490).
Lastly, all reads which belonged to a sample were merged into a “BAM” file using the “BAT_merging” module. Overlapping sequences were eliminated using the “ClipOverlap” (BamUtil version 1.0.13) module. The commands were:
In the next step, DNA methylation was detected with the aid of “BAT_calling”. The module generates a “VCF” file. This is a text file which only contains information about the detected DNA methylation rates, coverage, number of covered nucleotides and the sequence context. In the further course of the analyses, this file was filtered for CpG context and coverage of at least eightfold. In this context, figures were generated and further “VCF” files as well as “BedGraph” files were generated. Next, the “BAT_summarize” module was used, which ascertained the mean values of detected DNA methylation rates of two groups. The calculated DNA methylation rates and the genomic coordinates of the cytosines were written into a text-based “BedGraph” file, which was used later on for the identification of differentially methylated regions.
The visualization of DNA methylation per group was carried out using the “BAT_overview” module [201]. The commands were:
In the context of this work, data from two methods for genome-wide examination of DNA methylation patterns were used: WGBS and methylation array (HM 450K).
“Bedtools” software was used for the correlation analyses. The “Bedtools Intersect” module reads both the WGBS results and the HM 450K results, checks them for overlap and writes the overlapping CpG loci into a new “BED” file. The “BED” format is a text file. Each line of the file contains genomic coordinates of a CpG. The columns are separated by a tab character. The “BED” file was subsequently directly loaded into “R” and subjected to “Pearson” correlation analysis (p-value < 0.01). The results were likewise visualized in R.
The WGBS data were evaluated as described. The “BedGraph” file generated using the “BAT_summarize” module contained three groups (control, adenocarcinoma, squamous cell carcinoma) having, in each case, 11 289 424 positions per group. The “BedGraph” file was divided into two lists. The first list contained 29 877 loci which showed differences in DNA methylation between the tumor and control groups. The second list contained 76,374 CpG loci differentially methylated in adenocarcinoma and squamous cell carcinoma groups, respectively. Differentially methylated referred to the regions which had a difference in DNA methylation of at least 15%.
Next, the two lists were sorted according to chromosomes and annotated with the “HG19” reference genome. The CpG loci which were located on chromosomes X, Y and M (mitochondrial chromosome) and within common SNPs (≥1% of the population) and were not protein-coding were discarded.
The remaining CpG loci had to meet one of the three criteria in order to be incorporated into the plasma panel:
The DNA regions which met one of these three criteria were incorporated into the plasma panel (see Tab. 1). All calls used are described in detail below.
In addition to diagnostically or therapeutically relevant information (e.g., stage and tumor entity), the panel should also contain prognostic information. Therefore, it was extended by 33 CpG loci, which were collected in the context of a clinical study. The title of the study was: “Comprehensive characterization of non-small cell lung cancer (NSCLC) by integrated clinical and molecular analysis”.
The HM 450K data set made available contained information about the DNA methylation status of a total of 41 lung carcinomas. The patients were classified according to survival time. In this context, 28 patients were included in the prognostically favorable group (survival longer than 15 months) and 13 in the unfavorable group (survival shorter than 13 months). The 33 CpG loci incorporated into the panel were able to separate both groups from one another on the basis of the DNA methylation pattern and thus contained information relevant for prognosis.
In addition to the WGBS and HM 450K results, 26 differentially methylated regions from the study on bivalent chromatin in tumors were incorporated into the plasma panel.
Bivalent promoters carry both activating and repressing histone modifications, which play an important role especially during cell differentiation processes. They are commonly incorrectly regulated in tumor cells. During the study, WGBS and HM 450K data sets of various tumor samples and cell lines (n=7000) were analyzed.
The set of methylation markers according to the invention, the plasma panel, contained 630 differentially methylated regions (Tab. 1). It was synthesized by the company “Roche” (Switzerland) and shipped on dry ice. This was a custom synthesized, non-commercially available “SeqCap Epi Enrichment Kit” ( Roche, Switzerland). According to the manufacturer, the panel was suitable for the analysis of both tissue samples and circulating, cell-free DNA.
It was validated in the context of a pilot study. For this purpose, blood plasma from 12 patients was provided by the DZL. Of these, three patients were healthy or tumor-free at the time of examination (control group) and nine were suffering from non-small cell lung carcinomas of different stages (tumor group).
Validation was carried out in multiple steps. First, the validation material, the circulating, cell-free DNA, was prepared. Extraction from plasma, quantification, quality control (QC) and bisulfite conversion were carried out as already described in sections 1.1.2.1-1.1.2.3.
Each 10 ng of converted zfDNA was then used for library preparation. Library preparation was done in two steps. In the first step, as described in section 1.1.2.4, a WGBS Library was prepared from each sample, which contained information about the entire zfDNA methylome of the corresponding patient. However, since only the 638 differentially methylated regions were to be sequenced and analyzed in the further course, they were extracted from the entire methylome and enriched in the second step. This was done using the “SeqCap Epi Enrichment Kit”, of which the plasma panel synthesized by “Roche” was a component (see section 1.2.1).
The finished library was subjected to a QC and was quantified (see section 1.1.2.2) and subsequently sequenced on the “MiSeq” (“Illumina”, USA) (see section 1.2.2). The sequencing data were stored in “FastQ” format and had to be subsequently analyzed (see section 1.2.3). For this purpose, the bioinformatic pipeline from section 1.1.2.5 was adapted, since this time only the 638 specific regions of the plasma panel were to be analyzed rather than the entire methylome.
The results were lastly used to develop a classifier, which subsequently interpreted the DNA methylation patterns and provided diagnostically as well as clinically relevant information about the health status of a patient (see section 1.2.3.3).
The same principle can be used to analyze samples from a patient who is to be diagnosed with lung tumors. Here, the samples are, however, not pooled for analysis.
The “SeqCap Epi Enrichment Kit” was used to extract and enrich 630 differentially methylated regions from the whole cfDNA methylome. One of the components of the kit was the designed plasma panel (see Tab. 1). The oligonucleotides contained therein, also called “Capture Probes”, hybridized to the differentially methylated regions and could be enriched and amplified in the further course (
The 12 WGBS libraries produced were pooled equimolarly within the different groups and were first prepared for a hybridization reaction. In the case of diagnostic samples, either individual samples are hybridized or pools of samples, each provided with a “Barcode”, are used. For this purpose, 1 µg of the WGBS library pool with 10 µL of “Bisulfite Capture Enhancer”, 1 µL of “SeqCap HE Universal Oligo” and 1 µL of “SeqCap HE Index Oligo” were pipetted into a 1.5 mL reaction vessel having a small hole in the lid. The sample was evaporated in a vacuum concentrator until a clear white pellet could be seen. The “SeqCap HE Universal” and “SeqCap HE Index” oligonucleotides were added in excess (1 µL corresponded to 1000 pmol) and served to bind the exposed WGBS universal and index adapters. Thus, the WGBS adapters should be prevented from interfering with the subsequent hybridization reaction.
For the actual hybridization reaction, 7.5 µL of two times “Hybridisation Buffer” and 3 µL of “Hybridisation Component A” were directly added to the pellet, mixed for 10 s, briefly centrifuged and incubated at 95° C. for 10 min. Then, the sample was transferred into a 0.2 µL reaction vessel, admixed with 4.5 µL of “Capture Probes”, mixed well and incubated in a thermal cycler at 47° C. for 72 h. The lid of the thermal cycler was preheated to 57° C. The “Capture Probes” were specifically synthesized for this project. They contained 638 different oligonucleotides which were complementary to the examined differentially methylated regions (see Tab. 1) and specifically bound them in the course of the hybridization reaction.
In the next step, the bound “Capture Probes” were enriched and washed multiple times. For this purpose, multiple wash buffers as well as the “Capture Beads” were prepared according to the manufacturer’s instructions.
The hybridized sample was admixed with 100 µL of “Capture Beads”, briefly mixed and incubated in the thermal cycler at 47° C. for 45 min. The lid of the thermal cycler was preheated to 57° C. To prevent the beads from settling, the samples were briefly removed from the thermal cycler every 15 min and mixed. The “Capture Beads” used herein were streptavidin beads, which interacted with the biotinylated “Capture Probes”.
After incubation, the samples were removed from the thermal cycler and the “Capture Beads” were subjected to multiple wash steps. Separation of the beads from the buffer was performed each time at room temperature using the “DynaMagTM-PCR” magnet (“Thermo Fisher Scientific”, USA).
In the first part of the wash protocol, only buffers previously preheated to 47° C. were used. In this case, the sample was admixed with 100 µL of simple “Wash Buffer I”, briefly mixed, and pelleted with the aid of a magnet. The supernatant was discarded and the beads were dissolved in 200 µL of simple “Stringent Wash Buffer”, incubated in a thermal cycler at 47° C. for 5 min, and again pelleted with the aid of a magnet. The supernatant was again discarded and the beads were washed two further times with 200 µL of simple “Stringent Wash Buffer”.
The second part of the wash protocol took place completely at room temperature; accordingly, the buffers used for this had to be preheated to room temperature. First, the “Capture Beads” previously washed at 47° C. were dissolved in 200 µl of simple “Wash Buffer I”, mixed for 2 min, and pelleted with the aid of a magnet. The supernatant was discarded, the beads were admixed with 200 mL of simple “Wash Buffer II”, mixed for 1 min, and again pelleted with the aid of a magnet. Here too, the supernatant was discarded, the beads were dissolved in 200 mL of “Wash Buffer III”, briefly mixed, and lastly separated from the supernatant on the magnet.
For the subsequent elution, 50 µL of dH2O were directly added to the beads, the beads were incubated at room temperature for 2 min and pelleted with the aid of a magnet. The supernatant was carefully pipetted from the reaction vessel and was used for all further steps.
After washing, the enriched differentially methylated regions were amplified. For this purpose, 25 µL of two times “KAPA HiFi HotStart Ready Mix” (“Roche”, Switzerland) and 5 µL of “Post LM PCR Oligonucleotides” (“Roche”, Switzerland) were added, e.g., to 20 µL of eluate, mixed well and amplified in the thermal cycler with preheated lid using the following PCR program:
The amplified regions were subsequently purified, e.g., using the “AmpureXP” beads (“Beckman Coulter”, USA). For this purpose, the beads were first preheated to room temperature. The sample was transferred into a 1.5 mL reaction vessel. 50 µL of dH2O and 180 µL of “AmpureXP” beads were added to 50 µL of sample. The sample was briefly mixed, incubated at room temperature for 15 min, briefly centrifuged, and placed on the “DynaMag™-2” magnet (“Thermo Fisher Scientific”, USA). The supernatant was discarded and the beads were washed two times with each 200 µL of freshly prepared 80% ethanol. Then, the beads were dried at room temperature for 15 min. To elute the libraries, 52 µL of dH2O were pipetted onto the dry beads. The beads were mixed well, incubated at room temperature for 2 min, and again placed on the “DynaMag™-2”. The supernatant was carefully pipetted off and was used for quantification, QC (see section 1.1.2.2) and sequencing on the “MiSeq”.
Sequencing of the NGS library of enriched, differentially methylated regions was carried out on the “MiSeq”.
For this purpose, the library produced was first diluted to 4 nM and denatured. Then, 5 µl of the 4 nM library were transferred into a 1.5 mL reaction vessel, admixed with 5 µL of 0.2 M NaOH, briefly mixed, centrifuged at 280 g for 1 min, and incubated at room temperature for 5 min. The denatured library was then admixed with 990 µL of “Buffer HT1” (“Illumina”, USA) and again mixed well. This yielded a 20 pM library which was subsequently diluted to 4 pM using “Buffer HT1” and admixed with 10% “PhiX” (“Illumina”, USA).
Lastly, a “MiSeq 150 V3” cassette (“Illumina”, USA) was loaded with the finished sample and sequenced in a 76 PE run.
As described in sections 1.1.2.5.1 and 1.1.2.5.2, the data were subjected to a “FastQC” analysis and subsequently processed.
As described in section 1.1.2.5.3, the processed data were aligned against the “HG19” reference genome using the “Segemehl” algorithm. PCR duplicates were removed using “Samtools” (version 1.3.1, “Wellcome Trust Sanger Institute”, England, “Broad Institute of MIT and Harvard”, USA). The command was:
The DNA methylation rates within the sequenced regions were calculated using the “BAT_calling” module and filtered using the “BAT_filter_vcf” module according to the CpG context and a coverage of at least eightfold (see section 1.1.2.5.3). Lastly, the data were annotated against the regions of the plasma panel. The calls were:
The plasma panel was then used to analyze the DNA methylation pattern of a patient. From this, it was to be concluded whether a patient has a malignant lung tumor. If this is the case, information about the entity of the tumor and the prognosis of the patient affected was to be derived from the DNA methylation profile. This can be done on the basis of the correlation between the methylation patterns which are present in the patient and the methylation markers which are important according to the invention.
For this purpose, a classifier can be created which is capable of rapidly and reliably interpreting the results of the pipeline described in sections 1.2.3.1 and 1.2.3.2. A classifier, also called predictive modeling, is an example of supervised learning. It is the goal of a classifier, after receiving variables (e.g., DNA methylation patterns) and an annotation, to first create a model which is later capable of correctly classifying the variables of independent samples (
The software “Qlucore Omics Explorer”, e.g., offers several possibilities of creating, using DNA methylation data, an optimal classifier for the particular question. For this, a selection from three algorithms can be made: “k-Nearest Neighbors Algorithm” (kNN), “Support Vector Machines” (SVM) and “Random Trees” (RT). For kNN, a class assignment is made based on the consideration of k nearest neighbors. SVM describes each object by a vector in a vector space. Within the vector space, a hyperplane is placed such that it acts as a separation plane between the groups and divides them into two classes. RT consists of multiple uncorrelated decision trees which were generated during the learning process. Each tree makes a decision, the class having the most votes ultimately decides on the final classification.
In general, it is difficult to predict in advance which algorithm will provide the optimal results for a new problem. Therefore, all three available algorithms were tested to find the best one for the particular category.
40 surgical preparations and corresponding controls were examined for their genome-wide DNA methylation using the “Illumina Infinium HumanMethylation450K BeadChip”.
In comparison with healthy lung tissue, 898 aberrantly methylated CpG loci were identified in malignant tumor tissue (q< 1×10-23, σ/σmax> 0.4;
In the following, those CpG loci were selected, which allowed reliable classification of lung tumors on the basis of malignancy and entity. For this purpose, the bioinformatic analyses described in section 1.1.1 were carried out, which yielded 287 CpG loci. Said loci were incorporated into a set of methylation markers preferred according to the invention, the plasma panel (Tab. 1).
As described in section 1.1.2.2, each individual cell-free, circulating DNA sample was quantified and subjected to a strict quality control after extraction. The total amount of extracted DNA was 10 to 30 ng per sample, of which 1 ng was analyzed using the “Agilent 2100 Bioanalyzer”. The samples showed a clear peak at ca. 167 bp. The peaks at 35 bp and 10 380 bp corresponded to the bottom or top markers, respectively (not shown).
After bisulfite conversion, the cfDNA samples were used to produce WGBS libraries. The completed libraries were, in turn, quantified and subsequently subjected to a quality control using the “Agilent 2100 Bioanalyzer”. All samples showed a clear peak at ca. 300 bp and therefore met the requirements for sequencing.
The WGBS libraries produced were sent on dry ice to the “TATAA Biocenter”, where they were pooled and, depending on the sample sequenced with an average coverage of 8 to 10-fold on a “NextSeq 500” platform. The raw data were provided in “FastQ” format.
The quality of the raw data was checked using “FastQC” software. Since the samples were sequenced 76 PE, the read length was, as expected, 76 bp. Within a read, the content of adapters and of nonidentifiable signals was 0%. The accuracy of sequencing was specified in “Phred” values. Each “Phred” value describes how accurately nucleotide reads were made during the course of sequencing. The raw data had a “Phred” score of over 30, which corresponded to an accuracy of more than 99.9%. Furthermore, only a very small amount of kmers could be detected. Kmers refer to sequences having a minimum length of two nucleotides that repeat again and again in the raw data. The number of PCR duplicates was virtually 0%. The amount of PCR duplicates is ascertained by calculating the percentage of deduplicated sequences and comparing it with the number of all sequences. A small amount of kmers and PCR duplicates indicates good library and sequencing quality.
Furthermore, a WGBS-typical base composition was analyzed. During bisulfite conversion, most unmethylated cytosines were replaced by thymines. Therefore, the thymine content of the raw data was ca. 50% and the cytosine content was virtually 0%. The adenine and guanine compositions were not influenced during bisulfite conversion and were 25% each.
Subsequently, the WGBS raw data were processed using “Cutadapt” software (see section 1.1.2.5.2). The processing removed both overrepresented sequences and the 10 nt long overhang at the start of read 2.
The processed sequencing data were then loaded into the “Bisulfite Analysis Toolkit” and aligned against the “HG19” reference genome using the “Segemehl” algorithm implemented there. The efficiency of alignment is specified as mapping efficiency. This determines how much percent of reads can be assigned to the reference genome.. In this case, the mapping efficiency of the “Segemehl” algorithm was 98% to 99% and was therefore suitable for all further analyses.
Next, the alignments of the control, adenocarcinoma and squamous cell carcinoma groups were loaded into the “BAT_calling” module. The module ascertained DNA methylation rates of respective cytosines. The cytosines which lay within a CpG region and had a coverage of at least eightfold were then identified using the “BAT_filtering” module and used for all further analyses.
More than 4 million CpG loci per group met the criteria and were analyzed later on using the “BAT_overview” module. The results clearly showed that both the lung carcinoma group and the control group can be distinguished from one another on the basis of the DNA methylation patterns (
To detect the differentially methylated regions specific for the respective group, filtering was carried out according to a difference in DNA methylation of at least 15%. In this context, the number of differentially methylated CpG loci in the plasma of lung carcinoma patients was 18 000 (
To compare the detected DNA methylation patterns in the surgical preparations with those in the blood plasma of the lung carcinoma patients, a “Pearson” correlation analysis was carried out using “R” and “Bedtools” (see section 1.1.2.5.4), which, depending on the sample, yielded a concordance of 71% to 77% (p-value < 2.2 × 10-16,
This shows that results on the basis of surgical preparations or solid biopsies cannot be readily applied to liquid biopsies, so that the present validation with liquid biopsies is crucial for the validity of the diagnostic procedure.
First, as described in section 1.1.2.2, the extracted cfDNA samples were quantified and subjected to a quality control. For this purpose, 1 ng of each sample was examined using the “Agilent 2100 Bioanalyzer”. All cfDNA samples used showed a clear peak at ca. 167 bp. Subsequently, the samples were bisulfite-converted and used to produce NGS libraries. As described in section 1.2.1, production of the libraries was performed in two steps.
In the first step, WGBS libraries which comprised information about the whole cfDNA methylome were produced. All 12 WGBS libraries produced showed a clear, large peak at ca. 300 bp. The larger 300 to 1,000 bp peaks were the so-called daisy chains, i.e., ssDNA fragments hybridized to each other. According to the manufacturer’s instructions, they neither influence the subsequent hybridization reaction nor the actual sequencing and therefore do not have to be eliminated.
In the second step, the WGBS libraries produced were quantified, equimolarly pooled, and processed using the “SeqCap Epi Enrichment Kit”. The kit used herein contained the so-called “Capture Probes” which were specifically synthesized for this purpose. The “Capture Probes” specifically hybridize to the 638 regions of the plasma panel (see Tab. 1). After hybridization, the “Capture Probes” together with the bound differentially methylated regions were enriched, washed and amplified. The amplified library was then quantified and subjected to a quality control (e.g., “Agilent 2100 High Sensitivity DNA Kit”). The finished library had a high peak at ca. 300 bp and therefore met the sequencing requirements of the “MiSeq”.
First, sequencing was optimized on the “MiSeq”. Sequencing was done in a 76 PE mode. Thus, the first 76 bp of the sequenced DNA fragments were read from both ends. To achieve the optimal cluster density, the library was diluted to 4 pM. The libraries described herein were unbalanced. Unbalanced refers to libraries, whose AT or GC concentration is less than 40% or more than 60%. Because of their composition, such libraries usually have an unsatisfactory sequencing quality. To prevent this, the library can be admixed with “PhiX Control V3”. The concentration of “PhiX” must be individually adapted depending on the library. The optimal concentration of “PhiX Control V3” was 10% in the present case.
After sequencing, the data were stored in “FastQ” format. The quality of the raw data was checked using “FastQC” software.
Because of 76 PE sequencing, the read length was 76 bp. The content of adapters and nonidentifiable signals within a read was 0%. The raw data had a “Phred” score of over 30, which corresponded to a sequencing accuracy of more than 99.9%. The base composition (thymine content at ca. 50%, cytosine content at virtually 0%, adenine and guanine content at 25%) indicated successful bisulfite conversion. The first 10 nt of the second read was an overhang generated by the enzyme “Adaptase”. The deviation of the experimentally ascertained GC content from the theoretically calculated one was also because of the bisulfite conversion.
The number of PCR duplicates was ca. 15%. The number of deduplicated sequences deviated greatly from the total amount. However, this is not unusual for a panel. In contrast to a genome-wide sequencing, in a panel only a small region of the genome is sequenced. This leads to a very low complexity of the library and, accordingly, to the formation of PCR duplicates. The number of kmers is very low and does not interfere with further evaluation.
In summary, it can be stated that the panel sequencing data had a very good quality. To process the data, two steps were carried out. First, the 10 nt long overhang at the start of read 2 and adapters were removed using “Cutadapt” software. Then, the PCR duplicates were completely eliminated using “Samtools” software.
The processed sequencing data were then loaded into the “Bisulfite Analysis Toolkit”. Alignment was carried out using “Segemehl” against the “HG19” reference genome. The mapping efficiency was at least 90%. This means that at least 90% of the raw data could be assigned to the reference genome. The average coverage, i.e., the sequencing depth, was 10- to 30-fold depending on the sample.
In the next step, DNA methylation was to be detected. For this purpose, the 12 alignments were loaded into the “BAT_calling” module. The positions ascertained were then first annotated against the “HG19” reference genome using “Bedtools”. Then, the methylated positions were filtered according to a coverage of at least eightfold using the “BAT_filtering” module. Furthermore, the module for creating a classifier was used to select only those positions that were, on the one hand, located in a CpG region and, on the other hand, were listed in the plasma panel (Tab. 1).
The ascertained cfDNA methylation rates were used to create a classifier. As described in section 1.2.3.3, “Qlucore Omics Explorer” software was used for this purpose, which contained the following classification algorithms: “k-Nearest Neighbors Algorithm” (kNN), “Support Vector Machines” (SVM) and “Random Trees” (RT).
The plasma panel was designed such that it should be optimally capable of providing information regarding the malignancy, the entity and the stage of a tumor. These questions could be answered reliably by the choice of a suitable classifier. Furthermore, it should also be possible to obtain information relating to prognosis.
To assess a classifier, two parameters were considered: accuracy and complexity. The accuracy of a classifier was specified in values between 0 and 1, wherein 0 corresponded to an accuracy of 0% and 1 to an accuracy of 100%. Complexity indicated how many differentially methylated positions or markers had to be analyzed so that the classifier achieved this accuracy. The fewer markers that needed to be evaluated, the more appropriate the classifier was for the clinic. This is because the error rate, time and costs of the method increase with the number of positions to be analyzed.
The first question was whether a patient was suffering in general from a malignant lung tumor. For this purpose, both the kNN algorithm and the RT algorithm provided an accuracy of 100%. For classification, the RT algorithm required 237 differentially methylated positions present in the panel. The kNN, on the other hand, only 10 positions, which qualified it as optimal for this problem (
The question regarding entity could be answered by all three algorithms with an accuracy of 100%. For the calculations, kNN required 22 positions, SVM 22 positions and RT 10 positions. Therefore, the RT algorithm was best-suited for this question (
For the last question of tumor stage, it was most difficult to choose a suitable classifier. Using 523 positions, the SVM algorithm managed to distinguish the late tumor stages with 80% accuracy (
All positions and classification parameters are described in detail in the annex (see Tab. 2-4). The described results therefore render it possible to carry out a diagnosis of lung cancer from a liquid biopsy of a patient by means of sequencing of purified, bisulfite-converted DNA enriched via oligonucleotides which hybridize to the methylation markers. In this case, the sequencing data are preferably aligned against a reference genome using the Segemehl algorithm and then evaluated on the basis of the correlation of the methylation, optionally on the basis of the classification as described above.
Chromosomes M, X and Y were discarded; the commands were:
For this, CpG loci which lay within a cluster consisting of at least two further differentially methylated CpG loci were selected. All CpG loci of the cluster were either hypomethylated or hypermethylated. The distance between the CpG loci was 2 to 20 nt.
Table 1: Set of methylation markers (plasma panel; 630 differentially methylated regions). The column “Tumor” indicates whether an increased (hypermethylated) or reduced (hypomethylated) methylation was identified in tumor tissue. A. 350 regions which detect a malignant lung tumor. B. 247 regions which distinguish the most common lung carcinoma entities (adenocarcinoma and squamous cell carcinoma) from one another. C. 33 prognostically relevant CpG loci. Method: cfDNA (WBGS): cfDNA or surgical preparations (HM 450 K): surgical; the bivalent chromatin study: bChrSt.
Number | Date | Country | Kind |
---|---|---|---|
19195688.7 | Sep 2019 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/074775 | 9/4/2020 | WO |