The present invention belongs to the field of biotechnology, and more specifically, relates to a method for disease prediction by using cell-free DNA.
Tumor prediction is an important problem in the prior art, and many methods that can be applied to tumor prediction at present. Tumor prediction is conducted based on serological tumor markers, and many serum proteins such as CA125, CA19-9, CEA, HGF and the like, play a certain role in the diagnosis and detection of tumors [1, 2]. CT, nuclear magnetic resonance and other imaging means are used for tumor prediction. Gene prediction may base on the next-generation sequencing technology as follows. A) Tumor prediction may base on genomic variation at SNV level. Recent studies on cfDNA show that tumor-specific mutation studies can be used for early screening of tumors, in which tumor-specific somatic mutation can be detected by targeted sequencing with high depth or multiplex PCR, etc. [3, 4]. B) Tumor prediction may base on CNV. Variation at chromosome level or copy number variation can be detected by cfDNA whole genome sequencing [5-7]. C) Tumor prediction may base on chromosomal methylation. Recent studies show that methylation biomarkers can be used for tumor prediction [8, 9]. D) Tumor prediction may base on the specific nucleosome-associated blotting of the cfDNA fragment of tumor. CfDNA sequencing can reflect the length of the encapsulated nucleosome cfDNA fragment. The study by Jiang P et al. [7] pointed out that the cfDNA fragments of patients with liver cancer would be partially shorter than those of normal individuals in the detection of tumor fragments in the cfDNA of patients with liver cancer. Cristiano S et al. take the proportion of short fragments of cfDNA in each interval of the whole genome as a feature, which can be used to predict tumors and identify tissue types thereof. The positions of nucleosomes and the position of the end of cfDNA fragments on genome [12, 13] show a certain correlation with the tumor and its tissue source.
These above techniques are usually used in combination in existing tumor detection products and published tumor prediction research results. For example, LUNAR-2 (https://guardanthealth.com/solutions/#lunar-2) of Guardant Health is a combination of the above techniques of A), C), and D), and can reach a higher sensitivity in colorectal cancer detection. However, the specific method is unknown. Signature (https://www.natera.com/signatera), a postoperative tumor detection product of Natera company, based on the above A), selects 16 specific SNV loci, which can reach an ultrahigh sensitivity in the recurrence detection of colorectal cancer and lung cancer [14, 15]. Joshua D.cohen's team published a study in Science in 2018: CancerSEEK, a tumor detection method based on serum markers and SNV, shows a specificity of up to 99% and a sensitivity of 69% to 98% depending on cancer type when used in 1005 patients with 8 different types of tumors including lung cancer, liver cancer, colorectal cancer, etc [16].
There are some main shortcomings in tumor prediction in the prior art. For example, serological tumor markers usually exist simultaneously in the serum of normal individuals, which leads to lower precision and specificity in detection, so it is difficult to be applied in the early screening of tumors. There is a higher risk of false positive and false negative in the early screening of tumors when using CT, nuclear magnetic resonance and other imaging means for detection, and it is difficult to realize early screening of tumors. Gene detection based on next-generation sequencing technology may have the following problems. For detection based on genomic variation at SNV level, the specific variation cannot be detectable in all patients, and it is difficult to achieve large-scale popularization due to the high experimental cost. For detection based on CNV, only a small number of individuals have this type of variation. For detection based on genomic methylation, it is difficult to achieve large-scale application and popularization due to the higher cost. For detection based on the specific nucleosome-associated blotting of the cfDNA fragments of the tumor, it usually requires higher sequencing depth, and it is only in the stage of scientific research, and is difficult to be applied in clinical routine detection. In summary, there is no effective method for predicting early tumors in the prior art.
Transl Med, 2017. 9(403).
In view of the current situation that there is no effective disease diagnosis method in clinical practice, the present invention attempts to provide a disease prediction model with a relatively high accuracy and its construction method and application.
Therefore, in a first aspect, the present invention provides a method for constructing a cell-free DNA-based disease prediction model, comprising:
In one embodiment, the disease is cancer, and preferably, the cancer is lung cancer, liver cancer or colorectal cancer.
In one embodiment, the disease prediction includes early screening of tumors or detection of tumor recurrence.
In one embodiment, in 1), the cell-free DNA samples are derived from body fluids, such as blood.
In one embodiment, in 2), the coverage of the cell-free DNA on the genome is determined by the relative coverage.
In one embodiment, in 2), the transcription start site region refers to a region of 100 bp, 400 bp, 600 bp, or 1 kb upstream and downstream of the transcription start site.
In one embodiment, in 2), the genes having differences in the coverage at the transcription start site regions between the diseased individuals and the control individuals are sorted, and the genes with large value are selected.
In one embodiment, in 2), the gene set comprises 10-50 genes.
In one embodiment, in 3), the prediction model is a Logistic Regression model or a Random Forest model.
In a second aspect, the present invention provides a disease prediction model constructed according to the method of the first aspect of the present invention.
In a third aspect, the present invention provides a cell-free DNA-based disease prediction method, which uses the disease prediction model constructed according to the method of the first aspect of the present invention, comprising:
In a fourth aspect, the present invention provides a cell-free DNA-based disease prediction system, comprising:
In one embodiment, the disease is cancer, and preferably, the cancer is lung cancer, liver cancer or colorectal cancer.
In one embodiment, the disease prediction includes early screening of tumors or detection of tumor recurrence.
In one embodiment, in the sequence acquisition unit, the cell-free DNA samples are derived from body fluids, such as blood.
In one embodiment, in the gene set selection unit, the coverage of the cell-free DNA on the genome is determined by the relative coverage.
In one embodiment, in the gene set selection unit, the transcription start site region refers to a region of 100 bp, 400 bp, 600 bp, or 1 kb upstream and downstream of the transcription start site.
In one embodiment, in the gene set selection unit, the genes having differences in the coverage at the transcription start site regions between the diseased individuals and the control individuals are sorted, and the genes with large value are selected.
In one embodiment, in the gene set selection unit, the gene set comprises 10-50 genes.
In one embodiment, in the model constructing unit, the prediction model is a Logistic Regression model or a Random Forest model.
The present invention realizes rapid, efficient and low-cost early prediction of diseases such as lung cancer by using only the sequencing depth distribution information of cfDNA in one sampling without using any other assistant means and additional data.
Peripheral blood of tumor patients contains circulating tumor DNA (ctDNA) derived from tumor. CtDNA only accounts for a small part of all circulating cell-free DNA (cfDNA) in the peripheral blood. The present invention utilizes the changes of coverage depth of sequencing reads of cfDNA at the transcription start site (TSS), transcription terminal site (TTS) or nucleosome depletion region (NDR) to predict the disease. Furthermore, the present invention constructs a prediction model based on the coverage of the nucleosome interval.
The present invention provides a disease prediction model with a relatively high accuracy and its construction method and application. The method for constructing a cell-free DNA-based disease prediction model comprises: 1) obtaining sequencing data of cell-free DNA samples of a plurality of diseased individuals and a plurality of control individuals; 2) selecting a gene set having differences in the coverage at the transcription start site regions between the diseased individuals and the control individuals according to the coverage of the sequencing data of the cell-free DNA samples of the diseased individuals and the control individuals on the genome; and 3) for the genes in the gene set, training a prediction model by inputting the coverage of the sequencing data at the gene transcription start site regions to construct a disease prediction model. The cell-free DNA-based disease prediction method comprises: 1) for the cell-free DNA sample of the individual to be tested, obtaining sequencing data of the gene set determined in constructing the disease prediction model; 2) for the genes in the gene set, obtaining the coverage of the sequencing data at the transcription start site regions; and 3) inputting the coverage at the transcription start site regions into the disease prediction model to predict whether the individual to be tested suffers from the disease. In the above two methods, the gene set used corresponds to the method for calculating the coverage of the sequencing data at the transcription start site regions.
The application of the disease prediction model includes the cell-free DNA-based disease prediction. The present invention provides a cell-free DNA-based disease prediction system, which can be used to implement the cell-free DNA-based disease prediction.
According to a specific example of the present invention, plasma cfDNA sequencing data of normal controls and patients with early lung cancer are used as input data, and the specific steps are as follows:
After the completion of quality control of all raw off-machine sequencing data (fq format) of samples used for model training, prediction and validation, reads of the sequencing data are aligned to the human reference chromosomes by using alignment software (such as samse mode in BWA); SAMtools is used to calculate the duplication rate of duplicated reads, alignment rate and mismatch rate in the alignment results, and the reads aligned to the human reference chromosomes are selected.
For each sample, sequencing depth near the transcription start site (TSS) region (the region of 100 bp, 400 bp, 600 bp, or 1 kb upstream and downstream of the transcription start site can all be used as the region near the transcription start site) is calculated for each gene in the whole genome. Different computational methods are used for single-strand and double-strand sequencing. There are two cases, including forward alignment and reverse alignment, for single-strand sequencing. In the forward alignment, the start site of alignment in the bam file is directly recorded, and in the reverse alignment, the end site of alignment in the bam file is recorded as the start site of alignment. Then, depending on the direction of alignment, backward extension is performed in the forward alignment and forward extension is performed in the reverse alignment, extending 167 bp from the start site of sequencing to the peak length of cfDNA. For the double-strand sequencing, the fragments with read 1 and read 2 just aligned to the same chromosome and with an inserted fragment length of 120 bp to 300 bp are calculated.
The average sequencing depth near the transcription start site region of each gene is calculated after locating the distribution position of fragments on the genome according to the alignment file. In order to enhance the relevant signals, only the sequencing depth of the central 61 bp of the sequencing fragment is counted, and normalization is carried out according to the overall aligned read count, to remove the differences caused by different aligned read counts and obtain the relative coverage (RC).
For the region near the transcription start site of each gene (or transcript), the relative coverage values at the transcription start site regions of this gene of samples with lung cancer and control samples are tested for significance (general statistical monitoring methods such as rank sum test or T test can be used), and m (10-50, an appropriate value set according to the number of training samples) significantly different genes are selected as lung cancer-related genes for the subsequent construction of prediction model.
A prediction model is constructed by inputting the lung cancer-related gene matrix formed by the relative coverage at the transcription start site regions of the significantly different genes obtained in Step 3 corresponding to n samples used for model training. That is, the relative coverage at the region of 100 bp, 400 bp, 600 bp or 1 kb upstream and downstream of the transcription start sites of m significantly different genes corresponding to n samples is calculated to obtain the relative coverage matrix of n×m, which is used as training set D.
Statistical software such as R can be used to conduct the training of Logistics Regression, Random Forest or other prediction model, and the final results are stored as a prediction model for the prediction of the last step.
In one embodiment, the present invention uses a model based on Random Forest (default parameters).
For the sample set to be predicted, the relative coverage values at the transcription start site regions of genes obtained in Step corresponding to each sample are calculated. The m relative coverage values of each sample are taken as input and the prediction model obtained in Step 4 is used to predict whether the sample is a tumor sample.
Sampling and sequencing: plasma samples of healthy individuals and patients with lung cancer were taken to extract cell-free DNA. After the experimental library was constructed, sequencing was performed using BGIseq500 with PE100 and 3× sequencing protocol.
The present invention realizes lung cancer prediction with relatively high accuracy only by using the distribution of genome sequencing depth of cfDNA data in plasma obtained from one sampling, providing a concise, efficient and low-cost reference assistant means for the clinical diagnosis of lung cancer. The present invention integrates the coverage at transcription start site regions of different genes into a Random Forest model to realize the efficient early prediction of lung cancer with relatively high accuracy, and provides a comprehensive and systematic method for predicting lung cancer by using cfDNA data.
The data are derived from www.ebi.ac.uk (accession no. EGAS00001001024), sequenced by the Illumina platform with a length of pair-end reads of 75 bp, a read count in each sample of 17-79 MB, and a median of 31 MB. Please refer to Peiyong Jiang, et al. PNAS 2015 for detailed data description.
90 cell-free nucleic acid samples of liver cancer and 32 free nucleic acid samples of healthy control were included. The data were divided into the training set of 97 samples and the test set of 25 samples in the ratio of 8:2, where the ratio of samples of liver cancer to healthy samples was kept constant.
The three-step process, including preliminary data processing, calculation of the relative coverage value of sequencing coverage at the transcription start site region in single sample, and selection of liver cancer-related genes, was same to the previous description. After the Wilcox rank sum test was performed according to the relative coverage near the transcription start sites between two groups, 25 differential genes were screened as features by P value from small to large in the training set. A Random Forest model was built based on the training data set and then was applied to the test data set. The results are shown as follows:
The ROC curve of the test set is shown in
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/071822 | 1/14/2021 | WO |