The disclosure relates to an identification system of circulating biomarkers, a development method of circulating biomarkers, a cancer detection method and a kit, and more specifically, to an identification system of circulating biomarkers for cancer detection, a development method of circulating biomarkers for cancer detection, a cancer detection method and a kit.
There is a great variety of circulating biomarkers, including DNA, mRNA, microRNA, metabolites and proteins, even circulating tumor cells and extracellular vesicles (EV) may be used as circulating biomarkers. These types of circulating biomarkers can come from minimally invasive sample types, including blood, saliva, urine, etc. The sample collection process is simple, which is suitable for disease screening or tracking.
Exosomes are lipid bilayer vesicles with a size of about 40 nanometers to 100 nanometers secreted by cells containing active molecules, and almost all types of cells (including cancer cells) secrete these vesicles. Exosomes and microvesicles are both extracellular vesicles. Microvesicles are formed by outward budding from the cell membrane, while exosomes are formed by inward budding of endosomes, and released via fusion of late endosomes with cell membrane. By secreting exosomes containing signaling molecules, cells can transmit information to adjacent cells or to distant cells or tissues through the circulatory system. When exosomes are absorbed by receiving cells, the carried signaling molecules such as proteins, mRNAs or microRNA will change the gene or protein expression of the receiving cell. A growing body of studies have shown that tumors can use exosomes to transmit signaling molecules to regulate the tumor microenvironment, and can also deliver exosomes to distant organs through the circulatory system, so that a metastatic environment suitable for colonization and growth of cancer cells is formed to promote tumor metastasis, which is called the pre-metastatic niche. Therefore, analyzing the composition of exosomes might be able to provide a convenient and accurate detection tool for the diagnosis and prognosis of breast cancer.
Circulating biomarker can be used for the diagnosis and monitoring of disease, but it is not easy to find suitable circulating biomarkers. There are roughly two approaches for the search of circulating biomarkers, one is to search for candidate biomarkers in the tissue and then verify their use in the circulatory system (such as blood), another is to directly screen biomarkers from the blood. The former approach can focus on markers directly related to the diseases, but the markers may not be good circulating biomarkers due to the interference of molecules from other tissues or cells in the blood. In contrast, when directly screening biomarkers from the blood, it is difficult to know if the biomarkers are directly related to the diseases because the composition of blood is complex, and the concentration of biomarkers from specific tissues may be low and blocked by other constituent molecules. Moreover, even if only considering the protein and nucleic acid biomarkers that have been observed on exosomes in the literature, the total number of genes corresponding to them exceeds 10,000.
Therefore, how to establish statistical analysis and machine learning algorithms to predict potential exosome biomarkers for subsequent cancer detection and recurrence model development from the large amount of various existing omics databases becomes increasingly important.
An embodiment of the present disclosure provides an identification system of circulating biomarkers for cancer detection, a development method of circulating biomarkers for cancer detection, a cancer detection method and a kit. The identification system and the development method can predict potential exosome biomarkers for subsequent cancer detection and recurrence model development.
The identification system of circulating biomarkers for cancer detection of the embodiment in the disclosure includes a) an identification module, b) a computing module and c) an evaluation module. The a) identification module is used to identify expression levels of multiple genes in normal tissue samples and tumor tissue samples, and select genes with high expression levels in the tumor tissue samples. The b) computing module uses tissue-specific genes and group-enriched genes to calculate a weight of each human tissue’s contribution to plasma exosomes. The c) evaluation module computes the overlap of gene expression levels of plasma exosomes of healthy people and cancer patients by using an overlapping index, and selects circulating biomarkers and combinations thereof suitable for cancer detection based on the plasma exosomes.
The development method of circulating biomarkers for cancer detection of the embodiment in the disclosure includes the following steps. Expression levels of multiple genes in normal tissue samples and tumor tissue samples are identified, and genes with high expression levels in the tumor tissue samples are selected. Afterwards, a weight of each human tissue’s contribution to plasma exosomes is calculated using tissue-specific genes and group-enriched genes. Next, expression levels of plasma exosome genes of healthy people and cancer patients are compared by using an overlapping index, and circulating biomarkers and combinations thereof suitable for detection and evaluation of the plasma exosomes are then selected.
The identification system of circulating biomarkers for cancer detection of the embodiment in the disclosure uses the aforementioned development method of circulating biomarkers for cancer detection.
The cancer detection method of the embodiment in the disclosure uses the circulating biomarkers developed by the aforementioned identification system of circulating biomarkers for cancer detection, and the circulating biomarkers include BIRC5 and ART3.
The kit of the embodiment in the disclosure uses the circulating biomarkers developed by the aforementioned identification system of circulating biomarkers for cancer detection, and the circulating biomarkers include BIRC5 and ART3.
Based on the above, the disclosure provides a development method of circulating biomarkers, the large amount of data in the protein or nucleic acid database of diseased tissues is analyzed by using methods such as the null hypothesis test and the overlapping index simulation, etc. The circulating biomarkers which can be used for disease diagnosis or monitoring are selected by simulating the changes of specific biomarkers in the circulatory system after the occurrence of the disease.
In order to make the above-mentioned features of the present disclosure more comprehensible, the following embodiments are given and described in detail with the accompanying drawings as follows.
The following examples are described in detail in conjunction with the accompanying drawings, but the provided examples are not intended to limit the scope of the present disclosure. Moreover, terms such as “include”, “comprise”, “have”, etc. used in the text are all open-ended terms, that is, “including but not limited to”.
The disclosure provides an identification system of circulating biomarkers for cancer detection and a development method of circulating biomarkers for cancer detection. The identification system of circulating biomarkers for cancer detection of the disclosure uses the development method of circulating biomarkers for cancer detection of the disclosure. Therefore, for the purpose of succinct description, the following mainly illustrates with the identification system of circulating biomarkers for cancer detection. The details of the development method of circulating biomarkers for cancer detection of the disclosure are basically repeated with the identification system of circulating biomarkers for cancer detection of the disclosure, so it will not be described in detail below.
The identification system of circulating biomarkers for cancer detection of the embodiment in the disclosure includes a) an identification module, b) a computing module and c) an evaluation module, wherein the identification module is used to select tumor tissue-upregulated gene markers, the computing module is used to calculate tissue weights, and the evaluation module is used to evaluate differences between healthy people and patients. In the following, the a) identification module, the b) computing module, and the c) evaluation module will be used to describe the identification system of the circulating biomarker for cancer detection according to an embodiment of the disclosure.
In terms of definition explanation, regarding the identification system of the circulating biomarker for cancer detection in the disclosure, wherein “identification system” includes hardware operating platforms (personal computers, supercomputers, etc.) and software (application programming interfaces, data processing algorithms, etc.), “module” can be a block, area, part, application area, or operation area in the identification system, but the disclosure is not limited thereto.
In the identification system of circulating biomarker for cancer detection disclosed in the disclosure, a) identification module compares normal tissue samples and tumor tissue samples in multiple genes covered by the exon-level RNA-seq or their products such as protein and mRNA expression levels, so as to select genes with high expression level in tumor tissue samples. Although the embodiment is mainly described with transcriptomics as an example, the disclosure is not limited thereto, and can also be applied to other physical data such as proteomics. It must be noted that before the genes with high expression level in the tumor tissue samples are selected, the data quality control/quality inspection (QC, quality control) of the physical data is performed first.
In the present embodiment, the genes with high expression level in tumor tissue samples are selected using statistical analysis methods, the statistical analysis methods include a null hypothesis test and a fold change threshold. In the following, the null hypothesis test and fold change threshold will be explained in detail.
In the present embodiment, the null hypothesis test is used to examine whether the average expression level of each gene in tumor tissue samples is significantly higher than that in normal samples. The null hypothesis test includes Welch’s t-test, permutation test and false discovery rate (FDR). In more detail, the Welch’s t-test is used to calculate a p value, the permutation test is used to adjust the p value, and then the false discovery rate is used as a standard for screening to reduce the probability of selecting false high-expression genes. In the following, the Welch’s t-test, the permutation test and the false discovery rate will be explained in detail.
In this embodiment, Welch’s t-test allows tumor and normal tissue data variance to be different when testing whether the average expression level of each gene in tumor samples is significantly higher than that in normal samples. In more detail, the applied formula is as follows:
Test whether the average expression level (mean) of the gene of the tumor sample is statistically significantly higher than the average expression level of the gene of the normal sample, wherein
are respectively the sample standard deviations of cancer and normal samples. The null hypothesis (H0) here is: the average gene expression level of tumor samples ≤ the average gene expression level of normal samples, which belongs to the one-tailed test in the null hypothesis test. The probability threshold is set to be 0.5%, that is, if the probability of observing the current data statistically is less than 0.5% under the assumption condition of H0 (p value<0.005), then the hypothesis of H0 is rejected.
When samples are limited, the resampling-based permutation test can be an effective statistical test. In more detail, the applied formula is as follows:
wherein Npm is a number of random permutations in the permutation test,
is a cumulative number of Npm random permutations where p value ≤ p value before permutation, ppm; is a p value calculated by the permutation test. In this embodiment, set Npm=105, and ppm; can be regarded as a correction to the p value of Welch’s t-test.
In this embodiment, when screening from a large number of genes, in order to reduce the incidence of false positives, the false discovery rate q≤0.005 is used as the standard. In more detail, the applied formula is as follows:
In the process of screening a large number of genes at the same time, in order to reduce the probability of false positives, the p value obtained by the gene according to the null hypothesis test can be sorted from smallest to largest, and then the false discovery rate standard can be used to screen genes, wherein Ngene is the total number of screened genes, and pn is the p value for the nth gene (the genes have been sorted from smallest to largest according to the p value obtained by the null hypothesis test). After the maximum n value (nmax) satisfies q≤0.005 is calculated, the first nmax genes are the genes selected based on the false discovery rate.
Considering the interpretability of detection instrument results, this disclosure also sets appropriate fold change threshold conditions to exclude genes with too small fold change. The definition of fold change (FC) is:
That is, the ratio of the average gene expression level of tumor samples and normal samples. Although under the condition that the average gene expression level of tumor tissue is higher than the average expression level of normal tissue gene, a large number of genes can already be excluded (the exact number of genes excluded is related to the range of genes covered by each data set, and the data set used in this disclosure can exclude 40% to 50% of genes), and the number of genes left after screening with the condition of FC>2 is less than 5% of the original number of genes.
According to an embodiment of the present disclosure, exemplary operations are as follows. The triple-negative breast cancer RNA-seq gene expression level dataset GSE118527 in the Gene Expression Omnibus (GEO) database was analyzed, and the data of 88 cases with tumors and normal tissues around the tumors were compared, covering a total of 45,308 genes. The filter conditions are, for example:
Before b) computing module calculates the weight of each human tissue for plasma exosome contribution, the identification system of the circulating biomarker for cancer detection in this disclosure refers to subcellular location information of exosome database to see if the circulating biomarkers are expressed on the surface and/or inside exosome. Circulating biomarkers expressed on the exosome surface can be further used for antibody binding.
In the identification system of circulating biomarker for cancer detection disclosed in the disclosure, b) computing module uses tissue-specific genes and group-enriched genes to calculate the weight of each human tissue for plasma exosome contribution. In more detail, the applied formula is as follows:
Plasma exosomes are the sum of the exosomes secreted by various tissues/organs/blood cells in the blood. Therefore, the gene expression level of a gene (gn) on plasma exosomes can be expressed by the above formula. According to an embodiment of the disclosure, a total of 69 types of tissues, organs, or blood cells, etc. that provide detection data in large human omics databases such as HPA, FANTOM5, and GTEx are expected to cover all sources of exosomes in the blood (as shown in Table 1 below). In order to calculate the (Cgn × Wgn) weight of each tissue, according to an embodiment of the disclosure, human tissue and organ gene expression level data provided by online databases such as HPA, FANTOM5, and GTEx is used. First, several tissue-specific genes of each tissue are selected, the plasma exosome expression level of this type of gene is estimated as a contribution only from the highly expressed tissue, and then the (Cgn × Wgn) weight of the tissue is calculated. If there are no tissue-specific genes in a tissue, several group-enriched genes are selected, that is, a group of genes with significantly increased expression level in this tissue and other tissues, so as to jointly determine the (Cgn × Wgn) weight of each tissue.
According to an embodiment of the disclosure, when calculating the weight, for each tissue-specific gene and group-enriched gene, HPA, FANTOM5 and GTEx human tissue expression level data are used to calculate the expression level probability density function of each tissue with the lognormal distribution for best fitting the expression level data of each tissue. After the expression level distribution of a tissue is obtained, it is used to calculate the gene expression level distribution of the exosome released into the blood by the tissue under different (Cgn × Wgn) weights.
According to an embodiment of the disclosure, the exosome gene expression level data of 149 healthy people are assembled, and for the selected tissue-specific genes and group-enriched genes, an in-house algorithm is used to adjust and test the weight of several tissues at the same time, and find out the (Cgn × Wgn) weight that can best restore the plasma exosome expression level distribution of all tissue-specific and group-enriched genes. The order of magnitude of tissue weight obtained by the simulation is as follows:
Please refer to
In the identification system of circulating biomarker for cancer detection disclosed in the present disclosure, the c) evaluation module compares the gene expression levels of the plasma exosomes of healthy people and cancer patients by using an overlapping index, and selects circulating biomarkers and combinations thereof suitable for detection and evaluation of the plasma exosomes.
According to an embodiment of the present disclosure, the calculated weight is used to simulate a plasma exosome expression level distribution of circulating biomarker in the healthy people and the cancer patients after calculating the weight of each human tissue’s contribution to plasma exosomes in the b) computing module. An intersection area of probability density functions of plasma exosome expression levels of the healthy people and the cancer patients are calculated according to the simulated plasma exosome expression level distributions, and the intersection area is the overlapping index. The smaller the intersection area (overlapping index), the better it is expected to be able to distinguish healthy and cancer statuses by plasma exosome detection. When the overlapping index ≤ 0.70, it is listed as a potential selection target. For example, the aforementioned overlapping index may be 0.70, 0.65, 0.60, 0.55, 0.50, 0.45, 0.40, 0.35, 0.30, 0.25, 0.20, 0.15 or 0.10, etc., but the present disclosure is not limited thereto. Furthermore, in addition to the overlapping index calculation, the biomarker selection also comprehensively considers the known characteristics of the gene, such as the gene function known in the literature, the subcellular locations of the gene product (protein) in the cell, and the plasma membrane confidence, etc., and finally the circulating biomarkers and combinations thereof suitable for detection and evaluation of the plasma exosomes are selected.
According to an embodiment of the present disclosure, after calculating the weight of the contribution of each tissue/organ/blood cell to the plasma exosome expression level, it is used to simulate the plasma exosome expression level distribution of healthy people in genes that have high expression level in the tumor tissue as identified in previous embodiment. Then, based on the gene expression level data of 88 cases of triple-negative breast cancer tissues from the GSE118527 data set and considering the phenomenon that cancer tissue cells release more exosomes than normal cells, the expression level distribution of individual genes in triple-negative breast cancer patients in plasma exosomes is simulated. According to the simulated plasma exosome expression level distributions of healthy and diseased people, the intersection area of plasma exosome expression level distributions of healthy and disease people for each individual gene can be calculated, and the overlapping index can be obtained.
The smaller the overlapping index of a gene, the smaller the overlap of plasma exosome expression levels between breast cancer and breast cancer-free states of the gene, and thus potentially a better biomarker for distinguishing breast cancer and breast cancer-free states through exosome detection in the person to be examined.
According to the results of null hypothesis test and overlapping index analysis, the subcellular locations of these gene products (proteins) in the cell are further compared. The proteins noted to be expressed on the membrane in the HPA (human protein atlas) database are selected, or genes with plasma membrane confidence > 3 and extracellular confidence > 3 in the COMPARTMENTS Subcellular localization database are selected. From genes having fold change (FC) greater than 1.5, we sorted genes from low to high overlapping index, and ART3, BIRC5, CD274 and PTK7 are taken as examples for verification as exosome protein biomarkers. According to the annotations of HPA and COMPARTMENTS, ART3, BIRC5, CD274 and PTK7 may all be exosome surface proteins. In triple-negative breast cancer tissue RNA-seq study (GSE118527), through the analysis of a) identification module in this disclosure, the fold change and q values of ART3, BIRC5, CD274 and PTK7 are ART3: FC=2.7, q=2.3×10-9, BIRC5: FC=8.3, q=6.64×10-45, CD274: FC =1.6, q=2.2×10-9 and PTK7: FC= 1.9, q=2.4×10-12. In addition, according to the circulatory system simulation results of b) computing module in this disclosure, the amount and distribution of ART3 and BIRC5 in normal human plasma exosomes are significantly lower than those of CD274 and PTK7. In the verification of the cell line exosome, ultra-high-speed centrifugation is first carried out to separate cell line exosomes. After quantification by Nanoparticle Tracking Analysis (NTA), an equal number of exosomes are taken to compare the expression difference of these proteins in exosomes from normal breast epidermal cell lines (HMEC), triple-negative breast cancer cell lines (MDA-MB-231, MDA-MB-468 and HCC1806) and normal human plasma by immunoassay.
According to an embodiment of the present disclosure, the processes of immunoassay include the following steps. Firstly, 96-well round bottom white plates carrying the magnetic beads which conjugated with ART3, BIRC5, CD274 and PTK7 antibodies are prepared. 100 µL of exosome samples isolated from different cell lines and plasma are added to the wells in a concentration of 5×108 particles/mL, respectively. The reaction is performed on a shaker at 900 rpm at 37° C. for 60 minutes under non-lysing conditions. After washing the magnetic beads with 0.1% Tween-PBST, we then added 100 µL of 0.5 ug/mL biotin conjugated anti-CD81 antibody to each well and react for 60 minutes. After the magnetic beads are further washed with 0.1% Tween-PBST, 100 µL of streptavidin-HRP enzyme is added to each well and react for another 60 minutes. After the magnetic beads are washed with 0.1% Tween-PBST, the luminescent HRP substrate is added to react on the shaker for one minute and the luminescence signal is read.
In this disclosure, the detecting sensitivity to breast cancer cell exosomes is evaluated by analyzing the samples spiked with various concentrations of breast cancer cell exosomes in plasma exosomes. According to an embodiment of the present disclosure, exosomes from HCC1806, a triple-negative breast cancer cell line, are added to the 100 µL of size exclusion chromatography (SEC) processed plasma exosomes and make the final concentrations of HCC1806 exosomes to be 1×109, 2×108, 4×107, and 8×106 particles/mL respectively. Then the exosome surface protein is detected according to the above-mentioned magnetic bead immunoassay.
Exosomes are vesicles secreted by cells, which can carry molecules such as proteins, mRNA or microRNA of primitive cells. The exosomes of specific subgroups can be enriched by identifying surface proteins which perform an affinity purification, such as tumor exosomes, etc., and the biomarkers carried by it is further analyzed, so as to increase the specificity of detection. In the present disclosure, an optimized combination of exosome biomarkers can be developed by calculating C-D pair overlapping index of protein capture for enrichment and biomarker for detection. In an embodiment of the present disclosure, BIRC5 and PTK7 are used respectively as surface proteins for affinity purification, and ART3 is used as a biomarker for detection. First, the expression level distributions of enrichment biomarkers (BIRC5 or PTK7) are used to simulate the proportion redistribution of exosomes from different tissue sources after the enrichment step, then the expression level distribution probability density function of the detection biomarker (ART3) of healthy and diseased people and the associated overlapping index are calculated.
In one embodiment of the present disclosure, the overlapping index of the biomarker combination is verified by evaluating their performance in immunodetection of tumor cell exosomes addition to the plasma. 800 µL plasma exosome separated by the size exclusion chromatography (SEC) is taken, and exosomes from MDA-MB-231 and MDA-MB-468, which are triple-negative breast cancer cell lines, are added and the concentrations of exosomes were made to be 1×109 particles/mL for MDA-MB-231 and 3×108 particles/mL for MDA-MB-468, respectively. Next, the performance in exosome detection of two C-D pairs, BIRC5-ART3 and PTK7-ART3, are compared according to the above-mentioned method of magnetic bead based immunoassay, wherein BIRC5 and PTK7 antibodies are used to capture exosomes, and ART3 antibodies are used as detection antibodies. Please refer to (B) of
This disclosure also provides a cancer detection method, using the circulating biomarker developed by the identification system of circulating biomarker for cancer detection described above. The circulating biomarkers include BIRC5 and ART3, which can be used to detect triple-negative breast cancer. For example, BIRC5 or ART3 antibodies are immobilized on carriers (such as magnetic beads or antibody-absorbable reaction disks) to capture exosomes in samples such as plasma, urine, and spinal fluid, and then antibodies which recognize BIRC5 or ART3 or other proteins are used for immunodetection. During the detection, enzymes such as horseradish peroxidase (HRP) and their substrates or fluorophore reagents can be used to generate signals for the detection of exosome biomarkers.
This disclosure also provides a kit, using the circulating biomarker developed by the identification system of circulating biomarker for cancer detection described above. The circulating biomarkers include BIRC5 and ART3, which can be used to detect triple-negative breast cancer. The kit contains BIRC5 or ART3 antibody or a solid support with BIRC5 or ART3 antibody, such as magnetic beads or a reaction plate that can absorb antibodies, or this antibody reagent is combined with antibodies which recognize BIRC5 or ART3 or other proteins, which is for exosome detection with or without reagents such as enzymes including HRP and their substrates or fluorophores.
In summary, the present disclosure provides an identification system of circulating biomarkers for cancer detection, a development method of circulating biomarkers for cancer detection, a cancer detection method and a kit. The identification system and development method use null hypothesis test, computational deconvolution, overlapping index and other methods, based on gene expression data of proteins and nucleic acids, identify genes whose gene expression level in tumor tissue is significantly higher than that in normal tissue, and consider the fold change of gene expression level of tumor tissue compared with gene expression level of normal tissue and other screening conditions. After that, the exosome expression level distribution of these genes in the blood of healthy and diseased people is simulated, combined with the exosome expression level distribution of healthy and diseased people after the enrichment step, so as to sort out the candidate biomarkers of proteins and nucleic acids, which are used as priority references for subsequent clinical specimen verification.
This application claims the priority benefit of U.S. Provisional Application Serial No. 63/294,359, filed on Dec. 28, 2021. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
Number | Date | Country | |
---|---|---|---|
63294359 | Dec 2021 | US |