The invention pertains to the field of genomics and bioinformatics, and relates to a cfDNA classification method, apparatus and application.
Urogenital system tumors (prostate cancer, urothelial cancer and renal cancer) are serious diseases that endanger human health. The diagnosis and monitoring methods for urogenital system tumors are usually invasive, or lack sensitivity and specificity.
Renal cancer accounts for about 3% of adult malignant tumors and 90% to 95% of kidney tumors, of which about 75% are renal clear cell carcinomas. At present, surgical treatment is still the most effective treatment for localized renal cancer, but about 20% to 40% of patients will suffer the relapse after surgery. Renal cell carcinoma has low sensitivity to radiotherapy and chemotherapy. The mortality rate of renal cancer patients is as high as 40%. The high mortality rate caused by renal cancer is mainly due to the lack of obvious clinical symptoms in the early stage and the lack of effective treatment methods in the advanced stage. At present, imaging, fine needle aspiration (FNA), and core biopsy (CB) can only assist in monitoring and cannot give a clear diagnosis. At present, there is no tumor marker with good sensitivity and specificity that can be used for early diagnosis and postoperative follow-up of renal cancer.
Urothelial carcinoma is a malignant tumor that occurs in renal pelvis, ureter, bladder, urethra, etc. and covers transitional epithelial cells. It mainly includes upper urothelial cancer and bladder cancer where the renal pelvis and ureter are located. Among them, upper urothelial cancer is relatively rare, accounting for only 5% to 10% of urothelial cancers, but in China, the upper urothelial cancer accounts for a proportion of as high as 30% of urothelial cancers. A number of studies have shown that the regional characteristics of upper urothelial cancer may be related to the use of traditional Chinese medicine containing aristolochic acid and its analogues. In addition, although the tissue sources are the same, upper urothelial cancer and bladder cancer have very different clinicopathological characteristics. Screening of new risk factors, new targets, and new markers for diagnosis, prognosis and dynamic monitoring of urothelial cancer must consider these two subtypes of cancer at the same time. In addition, the high recurrence rate of urothelial cancer in patients may lead to an increase in number of operations, an increase in incidence of complications, and an increase in treatment costs. Patients with recurrence eventually need to undergo radical cystectomy or bilateral nephroureterectomy, which greatly reduces the survival rate and quality of life. At present, the diagnosis of bladder cancer can be performed by the imaging, fluorescence in situ hybridization FISH, and urine cytology auxiliary examination, but the sensitivity for low-grade bladder tumors is only 4% to 31%. At present, the most important method for diagnosing bladder cancer is cystoscopy, but cystoscopy is expensive and invasive, which increases the patient's pain. In addition, the recurrence rate of bladder cancer is high, and cystoscopy is inconvenient for long-term, lifelong and prognostic monitoring.
Prostate cancer is a common malignant tumor in men, and the incidence is on the rise to a certain extent. There are no symptoms in the early stage of prostate cancer. When the tumor develops to a certain extent, it will block urethra or invade bladder neck, causing frequent urination, urinary urgency, and urinary incontinence. Many patients are already in the advanced stage when a definite diagnosis is made, and many patients in the advanced stage have bone metastases. At present, the accepted diagnostic methods for prostate cancer are digital rectal examination and prostate-specific antigen (PSA) examination, but the level of PSA can also be affected by factors such as prostatitis, urinary retention, catheterization and drugs, resulting in a lot of false positive rates.
With the development of science and technology, the diagnosis technology for tumors is also constantly advancing. In June 2017, the World Economic Forum and the Expert Committee of Scientific American jointly selected the 2017 global top ten emerging technologies list, among which the non-invasive diagnostic technology for tumors was successfully selected and ranked first. The emergence of tumor non-invasive diagnostic technology, i.e., liquid biopsies, marks another big step forward for human beings on the road of conquering tumors. Compared with traditional tissue biopsy, liquid biopsy has unique advantages such as real-time dynamic detection, overcoming tumor heterogeneity, and providing comprehensive detection information. At present, in clinical research, liquid biopsy mainly includes free circulating tumor cells (CTCs) detection, circulating tumor DNA (ctDNA) detection, exosomes and circulating RNA (Circulating RNA) detection, etc.; as compared with traditional diagnostic technology relying on clinical symptoms or imaging, the use of liquid biopsy technology can detect disease progression earlier. Liquid biopsy is expected to play a major role in evaluating tumor dynamics and load changes during patient treatment, monitoring the effectiveness of treatment in real time, and monitoring small residual lesions, recurrence, prognostic evaluation, and drug resistance in patients.
At present, there is still a need to develop new detection methods for urogenital system tumors, which have better specificity and sensitivity, are more convenient for multiple, long-term and prognostic monitoring, and reduce patient suffering.
After in-depth research and creative work, the present inventors surprisingly found that the detection of free DNA (cfDNA) in urine supernatant is beneficial to the detection or diagnosis of an early stage, low-grade, non-invasive tumor in urinary system. Furthermore, the present inventors designed and completed experiments, sequencing and analysis, and by detecting the cfDNA copy number variation (CNV) in the urine supernatant, the diagnosis and classification of up to 3 urogenital system tumors can be completed at one time. The following invention is therefore provided:
One aspect of the present invention relates to a cfDNA classification method, comprising:
calculating a copy number variation data of cfDNA in a target sample;
calculating a similarity degree between the target cfDNA copy number variation data and the cfDNA copy number variation data of each category label; and
determining the category to which the target cfDNA belongs by using a classifier model according to the similarity degree.
In some embodiments of the present invention, in the classification method, to determine the category to which the target cfDNA belongs comprises:
according to the similarity degree, using a random forest model to determine the correlation degree between the cfDNA copy number variation data of each category label and a human urogenital system tumor;
according to the correlation degree, using the classifier model to determine the category to which the target cfDNA belongs.
In some embodiments of the present invention, in the classification method, to determine the correlation degree between the cfDNA copy number variation data of each category label and the human urogenital system tumor comprises:
according to the correlation degree, sorting the cfDNA copy number variation data to form a vector sequence;
inputting the vector sequence into the random forest model, and determining a correlation degree between the cfDNA copy number variation data of the category label and the human urogenital system tumor.
In some embodiments of the present invention, in the classification method, the human urogenital system tumor is one or more selected from the group consisting of prostate cancer, urothelial cancer and renal cancer;
preferably, the renal cancer is clear renal cell carcinoma,
preferably, the urothelial cancer is upper urothelial cancer and/or bladder cancer,
preferably, the prostate cancer is prostate adenocarcinoma;
preferably, the human urogenital system tumor is diagnosed by tissue biopsy of a surgical sample.
In some embodiments of the present invention, in the classification method, the random forest model is at least 3 random forest binary classifiers, and is one, two, three or four groups selected from the group consisting of the following Groups I to VI:
Group I.
normal-vs-renal cancer, normal-vs-urothelial cancer, normal-vs-prostate cancer;
Group II.
renal cancer-vs-normal, renal cancer-vs-urothelial cancer, renal cancer-vs-prostate cancer;
Group III.
urothelial cancer-vs-normal, urothelial cancer-vs-renal cancer, urothelial cancer-vs-prostate cancer;
Group IV.
prostate cancer-vs-normal, prostate cancer-vs-renal cancer, prostate cancer-vs-urothelial cancer.
In some embodiments of the present invention, in the classification method, each group is voted, the category corresponding to the group with the highest number of votes is the final category, and if there are groups with the same number of votes, the category corresponding to the group with the highest prediction probability in the groups with the same number of votes is the final category, and the present inventors define this integrated classification method as GUdetector.
In some embodiments of the present invention, in the classification method, the copy number variation data of cfDNA in the target sample and/or the cfDNA copy number variation data of each category label is obtained by calculation from a sequencing data of cfDNA in a urine sample; preferably, the sequencing data is a whole-genome sequencing data; preferably, its sequencing depth is 1× to 5×.
In some embodiments of the present invention, in the classification method, the copy number variation data of cfDNA in the target sample and/or the cfDNA copy number variation data of each category label is calculated according to the following method:
dividing a genome of a sample to be tested into 5,000 to 500,000 bins (for example, 50,000 bins) with equal lengths or equal theoretical simulation copy numbers; normalizing the sequencing data, and calculating a ratio A/B of the number of reads corresponding to each bin,
wherein:
A represents the actual number of reads in a bin after GC content correction;
B represents the theoretical number of reads in the bin, is obtained by dividing the total number of reads measured in the sample by the total number of bins;
the ratio A/B represents the copy number variation.
In one or more embodiments of the present invention, in the classification method, the genome of the sample to be tested is divided into 5,000 to 500,000 bins with equal lengths or equal theoretical simulation copy numbers by a software or algorithm, such as Varbin, CNVnator, ReadDepth or SegSeq.
In one or more embodiments of the present invention, in the classification method, the ratio A/B of the number of reads corresponding to each bin is calculated by a software or algorithm, such as Varbin, CNVnator, ReadDepth, or SegSeq.
In one or more embodiments of the present invention, in the classification method, the genome of the sample to be tested is divided into 10,000 to 200,000 bins with equal lengths or equal theoretical simulation copy numbers.
In one or more embodiments of the present invention, in the classification method, the genome of the sample to be tested is divided into 10,000 to 150,000 bins with equal lengths or equal theoretical simulation copy numbers.
In one or more embodiments of the present invention, in the classification method, the genome of the sample to be tested is divided into 10,000 to 100,000 (for example, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000 or 100000) bins with equal lengths or equal theoretical simulation copy numbers.
In some embodiments of the present invention, in the classification method, the urine sample is a morning urine; preferably, the urine sample is a morning urine supernatant.
In some embodiments of the present invention, in the classification method, the ratio A/B is a ratio A/B of each biomarker in a biomarker combination,
wherein,
the biomarker combination is any one of the biomarker combinations of the present invention described below.
Another aspect of the present invention relates to a method for the detection, diagnosis, classification, disease risk assessment or prognosis assessment of a human urogenital system tumor, which comprises the following step (1), step (2), optionally step (3), and step (4):
(1) collecting a urine sample and extracting cfDNA;
(2) screening to obtain cfDNA fragments of 90 to 300 bp or cfDNA fragments of 100 to 300 bp,
(3) using the obtained cfDNA fragments to construct a whole-genome library; preferably, performing whole-genome sequencing on the whole-genome library; and
(4) classifying the cfDNA fragments by the classification method according to any one of items of the present invention. The cfDNA fragments are the cfDNA fragments obtained in step (2) or the cfDNA fragments in the whole genome library in step (3).
In some embodiments of the present invention, in the method, the human urogenital system tumor is one or more selected from the group consisting of prostate cancer, urothelial cancer and renal cancer;
preferably, the renal cancer is clear renal cell carcinoma,
preferably, the urothelial cancer is upper urothelial cancer and/or bladder cancer,
preferably, the prostate cancer is prostate adenocarcinoma.
In some embodiments of the present invention, in the method, in step (1), the urine sample is a morning urine; preferably, the urine sample is a morning urine supernatant.
In some embodiments of the present invention, in the method, in step (2), the screening is a magnetic bead screening.
Another aspect of the present invention relates to an apparatus for the detection, diagnosis, classification, disease risk assessment or prognosis assessment of a human urogenital system tumor, comprising:
I. ‘normal decision-making unit’:
normal-vs-renal cancer, normal-vs-urothelial cancer, normal-vs-prostate cancer;
II. ‘renal cancer decision-making unit’:
renal cancer-vs-normal, renal cancer-vs-urothelial cancer, renal cancer-vs-prostate cancer;
III. ‘urothelial cancer decision-making unit’:
urothelial cancer-vs-normal, urothelial cancer-vs-renal cancer, urothelial cancer-vs-prostate cancer; and
IV. ‘prostate cancer decision-making unit’:
prostate cancer-vs-normal, prostate cancer-vs-renal cancer, prostate cancer-vs-urothelial cancer.
Another aspect of the present invention relates to an apparatus for the detection, diagnosis, classification, disease risk assessment or prognosis assessment of a human urogenital system tumor,
comprising a memory; and a processor coupled to the memory,
wherein,
the memory stores a program instruction to be executed by a processor, and the program instruction comprises any one, any two, any three, or all of four decision-making units selected from the group consisting of the following four decision-making units, wherein each decision-making unit comprises 3 random forest binary classifiers:
I. ‘normal decision-making unit’:
normal-vs-renal cancer, normal-vs-urothelial cancer, normal-vs-prostate cancer;
II. ‘renal cancer decision-making unit’:
renal cancer-vs-normal, renal cancer-vs-urothelial cancer, renal cancer-vs-prostate cancer;
III. ‘urothelial cancer decision-making unit’:
urothelial cancer-vs-normal, urothelial cancer-vs-renal cancer, urothelial cancer-vs-prostate cancer;
IV. ‘prostate cancer decision-making unit’:
prostate cancer-vs-normal, prostate cancer-vs-renal cancer, prostate cancer-vs-urothelial cancer.
In some embodiments of the present invention, in the apparatus, the processor is configured to execute the classification method according to any one of items of the present invention based on the instruction stored in the memory device.
In some embodiments of the present invention, in the apparatus, the urogenital system tumor is one or more selected from the group consisting of prostate cancer, urothelial cancer and renal cancer;
preferably, the renal cancer is clear renal cell carcinoma,
preferably, the urothelial cancer is upper urothelial cancer and/or bladder cancer,
preferably, the prostate cancer is prostate adenocarcinoma.
Another aspect of the present invention relates to a use of any one selected from the group consisting of the following items 1) to 3) in the manufacture of a medicament for detection, diagnosis, disease risk assessment or prognosis assessment of a human urogenital system tumor:
1) the biomarker combination according to any one of items of the present invention;
2) a cfDNA in a human urine, especially a cfDNA in a human urine supernatant;
preferably, the urine is a morning urine;
preferably, the cfDNA is cfDNA of 90 to 300 bp, or cfDNA of 100 to 300 bp; more preferably, the cfDNA is cfDNA of 90 to 150 bp, or cfDNA of 100 to 150 bp;
3) a DNA library, which is prepared by item 2); preferably, the DNA library is a whole genome library;
preferably, the urogenital system tumor is one or more selected from the group consisting of prostate cancer, urothelial cancer and renal cancer;
preferably, the renal cancer is clear renal cell carcinoma,
preferably, the urothelial cancer is upper urothelial cancer and/or bladder cancer,
preferably, the prostate cancer is prostate adenocarcinoma.
Another aspect of the present invention relates to any one selected from the group consisting of the following items 1) to 3), which is used for the detection, diagnosis, disease risk assessment or prognosis assessment of a human urogenital system tumor:
1) the biomarker combination according to any one of items of the present invention;
2) a cfDNA in a human urine, especially a cfDNA in a human urine supernatant;
Preferably, the urine is a morning urine;
Preferably, the cfDNA is cfDNA of 90 to 300 bp, or cfDNA of 100 to 300 bp; more preferably, the cfDNA is cfDNA of 90 to 150 bp, or cfDNA of 100 to 150 bp;
3) a DNA library, which is prepared by item 2); preferably, the DNA library is a whole genome library;
preferably, the urogenital system tumor is one or more selected from the group consisting of prostate cancer, urothelial cancer and renal cancer;
preferably, the renal cancer is clear renal cell carcinoma,
preferably, the urothelial cancer is upper urothelial cancer and/or bladder cancer,
preferably, the prostate cancer is prostate adenocarcinoma.
Another aspect of the present invention relates to a biomarker combination, which comprises m biomarkers, and m represents a positive integer greater than or equal to 50;
the biomarker is a DNA fragment, correspondingly having an initiate site of A±n1, and a termination site of B±n2 on the chromosome;
wherein, the n1 and n2 are independently non-negative integers less than or equal to 60,000;
wherein, the chromosome, A and B are any one group, any two groups, any three groups, any four groups, any five groups, any six groups (for example, the first 6 groups) or all 7 groups selected from the group consisting of the following Groups (1) to (7);
(1) Biomarkers for Renal Cancer Vs. Normal (the Smaller of the No. of the Biomarkers, the Higher of the Classification Effectiveness)
(2) Biomarkers for Urothelial Carcinoma Vs. Normal (the Smaller of the No. of the Biomarkers, the Higher of the Classification Effectiveness)
(3) Biomarkers for Prostate Cancer Vs. Normal (the Smaller of the No. of the Biomarkers, the Higher of the Classification Effectiveness)
(4) Biomarkers for Renal Cancer Vs. Prostate Cancer (the Smaller of the No. of the Biomarkers, the Higher of the Classification Effectiveness)
(5) Biomarkers for Urothelial Cancer Vs. Renal Cancer (the Smaller of the No. of the Biomarkers, the Higher of the Classification Effectiveness)
(6) Biomarkers for Urothelial Cancer Vs. Prostate Cancer (the Smaller of the No. of the Biomarkers, the Higher of the Classification Effectiveness)
(7) Biomarkers for Normal Vs. Prostate Cancer (Considering Gender Differences, Only the Male are Included in the Normal Population; the Smaller of the No. of the Biomarkers, the Higher of the Classification Effectiveness)
In some embodiments of the present invention, in the biomarker combination, m is 50 to 300 or greater than 300, such as 50 to 100, 100 to 150, 150 to 200, 200 to 250, 250 to 300, 50, 100, 150, 200, 250, or 300.
In one or more embodiments of the present invention, in the biomarker combination, n1 and n2 are independently 5,000, 4,000, 3,000, 2,000, 1500, 1,000, 500, 300, 200, 150, 100, 90, 80, 70, 60, 50, 40, 30, 20, 10, 5 or 0.
In one or more embodiments of the present invention, in the biomarker combination, the biomarker is a fragment of cfDNA; preferably, the cfDNA is derived from a human urine, especially a human urine supernatant.
In one or more embodiments of the present invention, in the biomarker combination:
the chromosome, A and B are shown in any 1 group, any 2 groups, any 3 groups, any 4 groups, any 5 groups, any 6 groups, or all 7 groups selected from the group consisting of the Groups (1) to (7).
Some terms involved in the present invention are explained as follows.
The term “bin” (interval/region) refers to a general description in the field of genomics that artificially defines or divides a genome according to a certain length. For example, when about 3 billion base pairs of human genome are equally divided into 3,000 bin pairs, each bin has a size of about 1 million base pairs.
The term “cfNA” is the abbreviation of cell free nucleic acid, which refers to a free nucleic acid in plasma, which is an extracellular nucleic acid fragment in the peripheral circulation.
The term “cfDNA” is the abbreviation of cell free DNA, which refers to a free DNA in plasma, which is an extracellular DNA fragment in the peripheral circulation.
The term “coverage” refers to a proportion of a region of genome that has been detected at least once in the entire genome. Coverage is a term that measures the coverage degree that the genome is covered by data. Due to the existence of complex structures such as high GC and repetitive sequences in the genome, the sequence obtained by final splicing and assembling in the sequencing often cannot cover the entire genome, and the region that is not obtained is called Gap. For example, if a bacterial genome is sequenced to have a coverage of 98%, then 2% of the sequence region is not obtained through the sequencing.
The term “sequencing depth” refers to a ratio of the total number of bases (bp) obtained by sequencing to the size of genome (Genome), or can be understood as the average number of times that each base in the genome is sequenced. For example, if a gene is 2M in size and the total amount of data obtained is 20M, then the sequencing depth is 20M/2M=10×.
The term “read” or “reads” refers to reads, that is, the measured sequence.
The term “pair-end reads” refers to paired reads.
The term “copy number variations (CNVs)” refers to the deletion or duplication of larger DNA fragments, i.e., the common increase or decrease in the copy number of DNA fragments ranging from hundreds bp to millions bp. CNVs are caused by genome rearrangement and are one of the important pathogenic factors of tumors.
The term “theoretical simulation copy number” refers to the copy number calculated by a software and/or method, in which the division of the genome is divided into several regions with equal or unequal lengths, but through data simulation, the theoretical copy number contained in each region is the same.
The beneficial effects of the present invention
(1) Trace detection reduces the cost of sequencing, and the detection is achieved under a lower and shallower coverage. The content of cfDNA released by early tumor cells is generally less than one percent or even one ten thousandth. Therefore, it is very challenging and requires a very deep sequencing depth for the current DNA detection technology to detect variations at levels of SNV (single nucleic acid variation) and INDEL (insertion/deletion) in ctDNA. However, the present inventors use cfDNA whole-genome sequencing technology to detect the copy number variation, which is theoretically and technically feasible. The sample sequencing depth used by the present inventors is only 1× to 5×, and a highly sensitive and specific diagnosis is achieved.
(2) Highly accurate diagnosis of single urinary system tumor is achieved.
(3) Tissue specific diagnosis. The problem of what tumor is diagnosed under unknown circumstances is solved. Based on the biomarker groups selected by the established classification system, the present inventors can determine at one time with high accuracy that the sample comes from which tumor in the urinary system.
(4) Truly non-invasive. Urine collection is simple and non-invasive, and cause no pain in patients, which is conducive to sample collection, diagnosis, long-term and regular prognostic monitoring.
The embodiments of the present invention will be described in detail below in conjunction with examples, but those skilled in the art will understand that the following examples are only used to illustrate the present invention and should not be regarded as limiting the scope of the present invention. If specific conditions were not indicated in the examples, they would be carried out in accordance with the conventional conditions or the conditions recommended by the manufacturer. The reagents or instruments used without the manufacturer's indication were all conventional products that were purchased commercially.
Preparation of cfDNA Sample
95 healthy people;
172 patients, comprising: 58 patients with clear renal cell carcinoma (ccRCC), 69 patients with urothelial carcinoma and 45 patients with prostate cancer. All were diagnosed by tissue biopsy of surgical samples.
There were a total of 267 cases of healthy persons and patients.
(1) Morning urine of the above-mentioned healthy persons and preoperative morning urine of tumor patients were collected. The urine of each case was collected in a 50 ml tube with about 20 to 50 ml. After collection, urine was placed in an ice box, and extracted within half hour to avoid degradation of cfDNA.
(2) The collected morning urine were centrifuged at 3500 rpm for 15 minutes, and then their supernatants were remained respectively.
(3) The cfDNA was extracted using zymo Quick-DNA™ Urine Kit. The concentrations were measured with Qubit4 Fluorometer, and they were stored at −80° C.
267 cfDNA samples were prepared.
The 267 cfDNA samples obtained in Example 1 above.
Extraction kit for free urine DNA: ZYMO Quick to DNA Urine Kit (ZYMO, Cat #: D3061).
Magnetic beads: AMPure XP beads (Beckman Coulter, Cat #: A63880).
Regular centrifuge.
(1) cfDNA of 100 bp to 300 bp was screened by magnetic beads (the range of size of the DNA fragments binded by the magnetic beads were controlled by the ratio of the volume of the magnetic beads to the volume of the cfDNA sample). The specific operations were as follows:
To extract urine cfDNA, 0.6 times of magnetic beads was added, the magnetic beads were discarded after binding for 5 minutes, the supernatant was retained, then 0.3 times of magnetic beads were added to the supernatant, the supernatant was discarded after binding for 5 minutes, and the magnetic beads were retained (notation: the purpose of adding 0.6 times the volume of magnetic beads was to bind large DNA fragments that were then discarded, and the addition of 0.3 times the volume of magnetic beads to the supernatant was to bind small fragments as target DNA fragments, thus the small DNA fragments were recovered), wash twice with 80% ethanol, and finally the DNA was dissolved with water.
(2) End-repair and adding A. The specific operations were performed by referring to the instructions of kits, NEBNext End Repair Module: catalog number E6050S; NEBNext dA-Tailing Module, catalog number E6053S.
(3) Adding PE adaptor. The specific operations were performed by referring to the operating instructions of kit, T4 DNA Ligase, catalog number M0202L.
(4) A adaptor-specific primer was used for PCR amplification.
(5) The PCR product obtained above was purified with magnetic beads to obtain the DNA library, i.e., the whole genome library of each sample from 267 cases.
In addition, Agilent 2100 Bioanalyser was used to conduct quality detection of the 267 libraries, and there was no adaptor contamination after the library was constructed.
Samples to be tested: the libraries of the 267 cases prepared in Example 2 above.
Whole-genome sequencing was performed. The sequencing was commissioned to Novagene Sequencing Company.
50 bp pair-end reads from 267 libraries were obtained. The sequencing depth of each sample was approximately 1× to 5×. These were used for the following tumor marker analysis.
According to the Varbin algorithm (Genome-wide copy number analysis of single cells. Nature protocols 7, 1024 to 1041, doi:10.1038/nprot.2012.039 (2012)), the genome of each sample was first divided into 50,000 bins, and then the number of reads and GC content in each bin were calculated in combination with the sequencing results of above Example 3, and the total number of reads and GC content obtained by sequencing each library sample were normalized, so as to obtain the original number of reads and the actual number of reads (A) corrected by GC content in each bin of each sample, in which the correction method was locally weighted scatterplot smoothing method (LOWESS smoothing); and the ratio A/B of the number of reads in each bin to the theoretical number of reads in the bin was further obtained:
A represented the actual number of reads in a bin after GC content correction;
B represented the theoretical number of reads in the bin, which was obtained by dividing the total number of reads measured in the sample by the total number of bins (50,000). Therefore, for a sample, the theoretical number of reads in each of its bins was equal.
The ratio A/B of greater than 1 indicated that this region was likely to have an increased copy number, equal to 1 indicated that this region had not changed, and less than 1 indicated that this region was likely to have a decreased copy number.
In the end, each sample got 50,000 ratios, and these 50,000 ratios (also called features) were used for the subsequent screening of markers.
For the 4 groups of object samples (healthy person samples, clear renal cell carcinoma patient samples, urothelial cancer patient samples, and prostate cancer patient samples), the object samples of each group were randomly divided into a training set (about 70%) and a test set (about 30%), so that 4 training sets and the corresponding 4 test sets were obtained respectively, and their respective numbers were shown in Table 8 below.
First, pairwise comparison was made among the 4 training sets. Specifically, each bin was subjected to pairwise comparison between different groups, and the comparison was performed successively until all 50,000 bins were checked. That was, t test was performed on the ratios A/B corresponding to 50,000 bins, and when a ratio A/B with significant difference (p<0.05) was screened out by the t test, the marker (bin) corresponding to the ratio A/B was found. For example, a bin was taken, the ratio A/B corresponding to the bin of the normal person group was compared to that of the renal cancer group, and the bin was retained when the statistical test showed significant difference, otherwise, it was discarded; and such calculation was performed on the 50,000 bins. In this way, a total of 6 pairwise combinations and 6 groups of markers with significant differences were obtained.
Then these 6 groups of markers were further screened by a specific method comprising: performing binary classification model training by inputting the ratios A/B corresponding to the 6 groups of markers into the random forest classifier, performing sorting on the basis of feature importance (that was, the operation results of random forest algorithm) (the more important the marker was for the classification, the higher its sort order was), selecting the top markers such as top500, top300, top100, top50, top10 to perform the random forest model training again, evaluating the prediction accuracy rates of the training set and the test set under different marker sets, selecting the markers with high accuracy rates as the final marker set (when the accuracy rates were basically the same, the present inventors tended to choose a smaller number of marker combinations), and thus obtaining a total of 6 groups of markers by the 6 random forest binary classifiers, each group containing 50 markers as shown in the previous Table 1 to Table 6.
The data corresponding to the 6 groups of biomarkers (markers) in Table 1 to Table 6 (the ratios A/B of the 6 maker groups) were separately extracted, and used for training by the random forest algorithm, so as to finally obtain 6 binary classification models.
The present inventors combined these 6 binary classification models to perform multi-category classification by voting, and the specific method was as follows:
the present inventors designed 4 decision-making units, and each decision-making unit contained 3 random forest binary classifiers:
I. ‘normal decision-making unit’: normal-vs-renal cancer, normal-vs-urothelial cancer, normal-vs-prostate cancer;
II. ‘renal cancer decision-making unit’: renal cancer-vs-normal, renal cancer-vs-urothelial cancer, renal cancer-vs-prostate cancer;
III. ‘urothelial cancer decision-making unit’: urothelial cancer-vs-normal, urothelial cancer-vs-renal cancer, urothelial cancer-vs-prostate cancer;
IV. ‘prostate cancer decision-making unit’: prostate cancer-vs-normal, prostate cancer-vs-renal cancer, prostate cancer-vs-urothelial cancer.
Then the present inventors performed voting for each decision-making unit, that was, the ratios A/B of the 6 groups of markers corresponding to a sample were separately input into the respective classifiers of the above 4 decision-making units to perform prediction classification, for example, ‘normal decision-making unit’ got votes N1 in prediction of the normal group, ‘renal cancer decision-making unit’ got votes N2 in prediction of the renal cancer group, ‘prostate cancer decision-making unit’ got votes N3 in prediction of the prostate cancer group, ‘urothelial cancer decision-making unit’ got votes N4 in prediction of the urothelial cancer group; finally, the category corresponding to the decision-making unit with the highest number of votes is the finally predicted category, and if there were groups with the same number of votes, the category with the highest prediction probability in the groups with the same number of votes was the finally predicted category.
At the same time, the 6 groups of markers were subjected to the verification of reliability in the public TCGA database. The TCGA contained the copy number data of various tumor tissues (data of primary tumor tissues and normal tissues), the corresponding four sets of data were downloaded, then the values corresponding to the 6 groups of markers were calculated (the segment values provided by TCGA were used to measure the change in copy number) and input into the random forest model for training and prediction, and the accuracy was evaluated.
As shown in
The analysis results were the calculation results of the final 6 groups of markers that were selected, which were obtained by the classification performance evaluated by the random forest binary classifier and calculated by the function in the R language.
Renal cancer vs. normal: sensitivity was 72.2%, specificity was 93.1%.
Urothelial carcinoma vs. normal: sensitivity was 76.2%, specificity was 100%. 3) As shown in
Prostate cancer vs. normal: sensitivity was 71.4%, specificity was 93.1%.
Renal cancer vs. prostate cancer: sensitivity was 72.2%, specificity was 85.7%.
Urothelial cancer vs. renal cancer: sensitivity was 95.2%, specificity was 77.8%.
Urothelial carcinoma vs prostate cancer: sensitivity was 85.7%, specificity was 85.7%.
The experimental methods and samples in Examples 1 to 3 were referred to. Integrated classification system (GUdetector) was used for the simultaneous classification of the 4 groups.
Diagnosis model of prostate cancer for male samples. The experimental methods and samples in Examples 1 to 3 were referred to, and the copy number data of 43 male patients in the non-tumor population and 45 prostate cancer patients were used to construct the classification model.
Prostate cancer vs. normal: accuracy rate AUC=0.967.
Considering the gender factor, the markers on all sex chromosomes were removed, the experimental methods and samples in Examples 1 to 3 were referred to, and the SVM model was used for the simultaneous classification of the 4 groups.
The prediction accuracy rate for each category was: 89.7% for the normal group, 76.2% for the urothelial cancer group, 64.3% for the prostate cancer group, 44.4% for the renal cancer group, and the overall accuracy rate was 72.0%.
The experimental methods and samples in Examples 1 to 3 were referred to, the SVM model was used to perform the simultaneous classification of the 3 groups, the results showed that the prediction accuracy rate for each category was: 88.5% for the normal group, 76.1% for the urothelial cancer group, 64.8% for the renal cancer group, and the overall accuracy rate was 78.4%.
The experimental methods and samples in Examples 1 to 3 were referred to, only 90 non-tumor individuals and 65 patients with urothelial cancer were used, and the SVM model was used to perform the diagnosis of urothelial cancer and compared with the LASSO and random forest methods. For the SVM, the prediction accuracy rate was 94.7% for the normal group, 86.5% for the urothelial cancer group, and the overall accuracy rate was 91.4%. For the LASSO, the prediction accuracy rate was 94.7% for the normal group, 75.0% for urothelial cancer group, and the overall accuracy rate was 86.72%. For random forest method, the prediction accuracy rate was 97.4% for the normal group, 80.8% for the urothelial cancer group, and the overall accuracy rate was 89.8%.
The experimental methods and samples in Examples 1 to 3 were referred to, the dynamic monitoring of therapeutic effect was exemplarily performed in 3 cases of urothelial cancer patients, before and after the operation of the 3 patients, the copy number of cfDNA and the proportion of tumor DNA in the total cfDNA were obtained by the ichorCNA algorithm. It could be seen that in all three patients, the copy number changes and tumor DNA content were detected before the operation, but they were not detected after the operation. This was consistent with the other tests of the patients. There was no recurrence in the three patients. The above results support that the present invention could also be used for non-invasive prognosis monitoring.
It was also noted that: Specificity and sensitivity are indicators to evaluate the efficiency of marker classification. Sensitivity refers to the ability to pick out cancer patients, and specificity refers to the ability to pick out normal people. For example, if there are 1,000 tumor patients and 1,000 normal persons, the present inventors could pick out 722 patients from the tumor group and 931 persons from the normal group by the classifier with sensitivity of 72.2% and specificity of 93.1%.
The sensitivity and specificity between two cancers refers to the ability to separate two tumors. Although these two concepts are used to evaluate negative and positive, or normal and abnormal, the present inventors herein also used them to evaluate two kinds of tumors, and the present inventors defined positive class, which was displayed as ‘positive’ class at the bottom of result.
In addition to the sensitivity value and specificity value, accuracy refers to the overall accuracy rate. The confusion matrix at the top of each result indicates the number correctly classified into a group and the number misclassified into another group.
Confusion matrix (Confusion matrix), Reference refers to the original category, Prediction refers to the predicted category, for example, the UC group, 16 UCs were predicted to be UC (predicted correctly), 2 UCs were predicted to be Normal, and 3 UC were predicted to be PRAD, none of them were predicted to be KIRC, and so forth;
the overall accuracy rate was 0.7195;
the prediction accuracy rate of each category was the corresponding Sensitivity below, and the specificity was not considered herein, because these two concepts were concepts of the classification for two categories, and the present classification was for 4 categories in which only the overall accuracy rate and the sensitivity of each category should be taken into account.
The present inventors first established a urine-based cfDNA copy number classification system, which could predict the different tissue sources of unknown urogenital system tumors at one time through the screened biomarker groups, and had high sensitivity and specificity. In addition, considering gender differences, only men had the need to assess the risk of prostate cancer. Therefore, the present inventors also retrained prostate cancer classification markers for men. In addition, excluding gender factors, three classification models of normal, renal cancer and urothelial cancer were trained. Since the ensemble classification voting method could not be used for the classification of 3 categories, the present inventors compared machine learning classification methods such as SVM, LASSO and random forest, and found that the SVM model was significantly better than the other two machine algorithm models (LASSO and random forest).
For a random unknown subject in the outpatient clinic (who could be a healthy person, or a patient with urogenital system tumor), the following method was referred to:
1. collecting morning urine, and extracting cfDNA;
2. screening DNA fragments of 100 bp to 300 bp with magnetic beads,
3. construction of whole genome library;
4. performing the whole-genome sequencing on the library to obtain sequencing data;
5. dividing the genome of the sample into 50,000 bins; normalizing the sequencing data, and using the varbin algorithm to calculate the reads ratios corresponding to the 50,000 bins;
6. extracting the ratios corresponding to the 300 markers shown in Table 1 to Table 6, and inputting them into the above integrated classification system (GUdetector) for prediction.
The specific operations of the above steps 1 to 4 were referred to Examples 1 to 4 respectively.
Prostate cancer is a male-specific tumor. Therefore, if gender factors were not taken into account, since healthy people comprised males and females, the number of copies of sex chromosomes would overestimate the diagnostic accuracy of the classifier. Therefore, when the inventors of the present invention diagnosed whether an unknown male object had prostate cancer, men of healthy population were used for re-screening of markers (healthy men vs. prostate cancer patients, Table 7). For a male subject in the outpatient clinic, the following method was referred to:
1. collecting a morning urine and extracting cfDNA;
2. screening DNA fragments of 100 bp to 300 bp with magnetic beads,
3. construction of whole genome library;
4. performing the whole-genome sequencing on the library to obtain sequencing data;
5. dividing the genome of the sample into 50,000 bins; normalizing the sequencing data, and using the varbin algorithm to calculate the reads ratios corresponding to the 50,000 bins;
6. extracting the ratios corresponding to the 50 markers shown in Table 7, and using a machine learning algorithm such as SVM to predict whether the unknown sample was a prostate cancer patient.
The specific operations of the above steps 1 to 4 were referred to Examples 1 to 4 respectively.
For a random unknown subject in the outpatient clinic (who could be a healthy person, or a patient with renal cancer and urothelial cancer), the following method was referred to:
1. collecting a morning urine and extracting cfDNA;
2. screening DNA fragments of 100 bp to 300 bp with magnetic beads,
3. construction of whole genome library;
4. perform the whole-genome sequencing on the library to obtain sequencing data;
5. dividing the genome of the sample into 50,000 bins; normalizing the sequencing data, and using the varbin algorithm to calculate the reads ratios corresponding to the 50,000 bins;
6. extracting the ratios corresponding to the 150 markers shown in Tables 1, 2 and 5, and using a machine learning algorithm such as SVM to predict whether the unknown sample was normal person, renal cancer patient, or urothelial cancer patient.
The specific operations of the above steps 1 to 4 were referred to Examples 1 to 4 respectively.
The copy number analysis of cfDNA could be obtained by other algorithms, such as the ichorCNA algorithm. In this method, the genomic region was divided into uniform regions with a length of 1,000,000 bp, and then the copy number variation and the proportion of tumor-derived DNA were calculated. For a patient who was checked before surgery and rechecked after treatment in the outpatient clinic, the following method was referred to:
1. collecting a morning urine before surgery and a morning urine during regular review, and extracting cfDNA;
2. screening DNA fragments of 100 bp to 300 bp with magnetic beads,
3. construction of whole genome library;
4. performing the whole-genome sequencing on the library to obtain sequencing data;
5. using the ichorCNA method to obtain the copy number variation atlases of cfDNA in the urine of the cancer patient before surgery and in the urine during regular review, and estimating tumor DNA contents;
6. evaluating the treatment efficacy and recurrence of the patient according to the comparison of the above atlases and tumor DNA contents.
The method in the reference, Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma, was used.
The input data were the ratios AB corresponding to the 6 groups of biomarkers (markers) in Table 1 to Table 6.
The results were shown in Table 9 below.
The results showed that when the LASSO classification model was used, the accuracy rates of various predictions were lower than those of the integrated classification system (GUdetector) proposed by the present inventors, and the overall accuracy was only 58.5%.
The method in the reference, CancerLocator: non-invasive cancer diagnosis and tissue-of-origin prediction using methylation profiles of cell to free DNA, was used.
The input data were the ratios AB corresponding to the 6 groups of biomarkers (markers) in Table 1 to Table 6.
The results were shown in Table 10 below.
The results showed that when the SVM classification model was used, the accuracy rates of various predictions were lower than those of the integrated classification system (GUdetector) proposed by the present inventors, and the overall accuracy was only 54.7%.
The method in the reference, Epigenetic profiling for the molecular classification of metastatic brain tumors, was used.
The input data were the ratios A/B corresponding to the 6 groups of biomarkers (markers) in Table 1 to Table 6.
The results were shown in Table 11 below.
The results showed that when the random forest classification model for four categories was used, the accuracy rates of various predictions were lower than those of the integrated classification system (GUdetector) proposed by the present inventors, and the overall accuracy was only 65.1%.
Although the specific embodiments of the present invention have been described in detail, those skilled in the art will understand that according to all the teachings that have been disclosed, various modifications and substitutions can be made to those details, and these changes are all within the protection scope of the present invention. The full scope of the invention is given by the appended claims and any equivalents thereof.
Number | Date | Country | Kind |
---|---|---|---|
201910374094.1 | May 2019 | CN | national |
This patent application is the U.S. National Stage of International Patent Application No. PCT/CN2020/087830, filed Apr. 29, 2020, which claims priority to Chinese Patent Application No. 201910374094.1, filed May 7, 2019, each of which is hereby incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/087830 | 4/29/2020 | WO |