The present invention relates to method(s) for measuring gene copy number and gene expression, quantitative PCR, qRT-PCR, normal individuals, medical conditions including the patients with cancer, ovarian cancer, ovarian serous adenocarcinoma, cancer diagnosis, cancer detection, therapy monitoring and laboratory diagnostics.
The gene copy number (also gene “copy number variants” or CNV) is the number of copies of a particular gene in the genotype of an individual. In the human genome, DNA encodes more than 25,000 protein coding genes and many thousands of non-protein coding genes. It was generally thought that genes in somatic cells were almost always present in two copies in a genome. However, recent discoveries have revealed that larger numbers of the segments of DNA could be observed. The size of such segments ranges from hundreds to millions of DNA bases, providing variation in DNA segment/gene copy-number. Such differences in the CNV of the individual genomes occurs in normal body cells, contributing to the organism's uniqueness. However, these DNA amount changes also influence most traits including susceptibility to disease. CNV can encompass individual genes and their clusters leading to dosage imbalances. For example, genes that were thought to always occur in two copies per genome have now been found to sometimes be present in one, three, or more than three copies. In various medical conditions and disease progression states, some DNA loci containing key regulatory genes are missing.
Gene or DNA copy number (CN) is usually measured by an average number of DNA copies per genome per cell in a biological sample. Gene copy number variation (CNV) is observed in normal tissue samples and is amplified in certain diseases, such as cancers. It has previously been demonstrated that CNV of a given gene directly affects its expression. The exact relationship between the CNV and the gene expression values is poorly studied but it is thought to be a nonlinear relationship which depends on cell, tissue, organism and medical conditions. The accurate and reproducible detection of CN and CNV of a given genome locus (or loci) and an establishment of their quantitative interconnection with the variation of expression of a gene belonging to a given CNV locus (or loci) is a great challenge. A practical solution of this problem is urgently needed for optimization of healthcare strategies, evaluation of the status of normal individuals and for diagnosis, prognosis and prediction for patients with medical conditions.
qPCR-based assays are considered as “gold standards” for detecting a variety of medical conditions attributed to gene expression changes and are broadly used in common clinical practice. Gene expression level in the cells and/or tissue samples is usually ranged within 5-6 orders of magnitude and a detection of the variation of such characteristics is provided by qPCR-based techniques, often with high accuracy. However qPCR-based assay interpretation is majorly dependent on measurement of cycle threshold (CT) values of the target gene(s) relative to CT values of reference/normalizing gene(s) (e.g. ACT B, GAPDH etc.). This condition might be a limitation in the context of cell or tissue specification and of bio-medical or environmental conditions, due to a systematic or random error variation that could occur in the reference/normalizing gene(s). In particular, some of the reference/normalizing gene(s) can also vary in a correlated manner with expression levels of the gene(s) of interest in a given cell/tissue sample. For example, GAPDH, commonly used as a reference gene, is considered to be an oncogene in breast cancer as its expression level is highly correlated with cancer progression level. Therefore, this gene cannot be used as an invariant reference for breast cancer assays. The variation in expression levels of the reference/normalizing gene(s) could also be prone to non-specific and poorly controlled noise, due to the heterogeneous sample cell composition. Thus, in many cases conventional reference/normalizing gene(s) might not be usable as “universal” and “independent” controls providing robust, unbiased and accurate measurements of the expression of a given gene of interest estimated via CT value analysis calculations for a qPCR assay. An identification of adequate reference/normalizing gene(s) for the accurate, robust and reliable detection of the DNA copy number variation (CNV) of a given gene locus using qPCR-based assays appears to be more challenging. Firstly, the dynamical range of CNV detection is limited to a few delta-delta CT-values, which is a less accurate and more noise-prone measurement procedure than that of gene expression. Secondly, the actual measurement in a cell/tissue sample is defined by delta-delta CT-values, averaged across many cells of a biological sample. CNV of the “control” genes across a single sample can be observed even in normal tissue samples, and is much more amplified in some pathological cases. Thirdly, in certain diseases, such as serous ovarian carcinoma, CNV of a given gene might directly affect the gene expression. The exact relationship between the CNV and the expression values is poorly understood and might be non-linear. Present methods for measuring gene CN and expression have been designed ignoring these facts. Therefore, gene CN and expression values obtained with any existing measurement method are affected by the unobserved CNV. Therefore, in such cases the CNV of the reference gene set also affects the observed expression values of any other gene measured in a given assay. Thus, the problem of indefinite CNV may invalidate any gene expression measurement. In many situations, such as those indicated above, more accurate, unbiased and robust reference/normalizing gene(s) should be identified, and appropriate primers should be optimized for use in detecting gene expression (mRNA/ncRNA) and CN (DNA) level.
Some embodiments relate to a method for determining a quantitative measure of a target gene in a biological sample from a subject, the method comprising:
Other embodiments relate to a kit for obtaining reference gene measurements in one or more biological samples, the kit comprising oligonucleotide primers capable of binding to and/or amplifying at least a portion of the nucleic acid sequence, and/or cDNA derived therefrom, of at least one gene selected from the group consisting of: XRCC5; AUTS2; EIF5; PARN; YEATS2; and FHL2.
According to a preferred embodiment of the kit, the primer sequences are selected from or derived from oligonucleotide sequences identified in Table 6 as SEQ ID Nos: 1-24.
According to a preferred embodiment of the kit, the primers are capable of binding to and/or amplifying at least a portion of the nucleic acid sequence, and/or cDNA derived therefrom, of at least one locus selected from Table 1, Table 2, Table 3, Table 4, Table 5, Table 8, Table 9, Table 10, Table 11, Table 13 or Table 14.
Further embodiments relate to a computer program or a computer device comprising a computer program which is capable of implementing the method according to any aspect of the present invention.
Further embodiments relate to a computer-implemented method for identifying reference genes and/or loci for relative quantitation of a target gene or locus, the method comprising:
Yet further embodiments relate to a computer-implemented method for identifying reference genes/loci for relative quantitation of a target gene/locus, the method comprising:
Yet further embodiments relate to a method for measuring target gene(s) DNA copy number in one or more samples, the method comprising:
Further embodiments relate to a system for identifying reference genes and/or loci for relative quantitation of a target gene or locus, the system comprising:
Yet further embodiments relate to a system for identifying reference genes/loci for relative quantitation of a target gene/locus, the system comprising:
Other embodiments relate to a non-transitory computer readable medium having program instructions stored thereon for causing at least one processor to carry out the method according to any of the above embodiments.
Embodiments of the present disclosure relate to a novel method for obtaining accurate CN and gene expression measures of a given gene of a given subject via normalizing the measured values onto CN of the proposed DNA sequences (rtPCR/qPCR) primers associated with one (or more) of the obtained reference genes selected by a reference gene identification method which works at the genome level across populations of individuals and diverse medical conditions.
In certain embodiments, specified DNA sequences of a reference gene set, along with loci coordinates of the respective primers, might be optimized for a given patho-biological context and medical conditions. The practical efficacy/power of embodiments of the method is demonstrated using epithelial ovarian cancer (EOC) samples. Embodiments propose a reference gene set previously never used as a reference or normalization control in qPCR-based assays. This set is proposed for use in detection of expression and DNA copy number variation in ovarian serous adenocarcinoma samples. Embodiments also provide a computational method allowing one to select “reference and normalization” genes for any sample set, sharing specific biological or pathological characteristics, such as tissue of origin or/and medical condition.
Some embodiments relate to an in vitro method for obtaining information on the number of DNA copies (CN) of a given locus of interest in a biological sample, the method comprising:
i) obtaining the CN value of the locus of interest in the biological sample;
ii) obtaining the CN value or values of one or more CN-invariant locus reference(s) (CNILR) in the biological sample, wherein the CNILR is defined as a which is locally CN-invariant, or as a locus with a minimal coefficient of variation value of its CN values across said group;
iii) obtaining the CN value or values of or one or more CN-invariant survival-insignificant locus reference(s) (CNISILR), wherein the CNISILR being defined as a CNILR, whose CN value, or any expression value of the genes within the locus, cannot define more than one subgroup of said group, based on survival prediction analysis; and
iv) normalizing the CN value of the locus of interest by the CN value of said one or more CNISILRs if defined, otherwise normalizing the CN value of the locus of interest by the CN value of said one or more CNILRs.
In a preferred embodiment, said one or more CNILRs in the biological sample is/are determined by:
i) providing a representative reference data set containing measurements of genome-wide CN variation with respect to a group of samples;
ii) identifying a set of loci with the lowest variation across the reference data set as the reference loci;
iii) ranking the reference loci by their median CN values across the reference data set; and
iv) selecting one locus or a set of loci with the highest median CN value(s) as the CNILR(s).
In another preferred embodiment, said one or more CNISILRs in the biological sample is/are determined by:
i) providing a representative reference data set containing measurements of genome-wide CN variation with respect to a group of samples;
ii) identifying a set of loci with the lowest variation across the reference data set as the reference loci;
iii) identifying a subset of loci, whose functions and/or transcriptional activity are not statistically associated in the reference data set, as loci with no significant statistical association;
iv) ranking the loci with no significant statistical association by the coefficients of variation of the expression values of the transcripts originating in these loci across the reference data set; and
v) selecting one locus or a set of loci with the lowest coefficient(s) of variation of the CN values as the CNISILRs.
The normalization may be conducted by normalizing the CN value of the locus of interest by the CN value of the CNISILRs. Alternatively, or in addition, normalization is conducted by normalizing the CN values of the locus of interest by the median CN values of more than one CNISILRs. Normalization may also be conducted by normalizing the CN value of the locus of interest by the CN value of one CNILR or by the median CNNILRs.
According to a preferred embodiment, said one or more CNILRs or CNISILRs is one or more loci from the group consisting of: XRCC5; AUTS2; EIF5; PARN; YEATS2; and FHL2.
More particularly, said one or more CNILRs or CNISILRs is/are selected from the loci identified in Table 1, Table 2, Table 3, Table 4, Table 5, Table 8, Table 9, Table 10, Table 11, Table 13 or Table 14.
According to a preferred embodiment, said one or more CNILRs or CNISILRs is/are selected if the coefficient of variation is less than a computationally or empirically predetermined threshold is equal to 0.05.
Some embodiments relate to an in vitro method for determining the CN of a target gene in a biological sample, the method comprising:
Other embodiments relate to a method for determining the set of CN-invariant loci in a given set of samples, the method comprising:
Further embodiments relate to an in vitro method for determining the expression of a target gene in a biological sample, the method comprising:
The CN value of the locus of interest and/or of said reference locus or loci in the biological sample may be determined as a gene expression value originating from a transcript of said locus.
In a preferred embodiment of any aspect of the present invention, the sample is obtained from cells or tissues from cancer patients or cell cultures derived from cancer patients.
The cancer patients may have a cancer type or subtype selected from ovarian cancer, breast invasive carcinomas, head and neck squamous cell carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, prostate adenocarcinoma, colon adenocarcinoma, stomach adenocarcinoma, hepatocellular carcinoma, or cervical squamous cell carcinoma.
In a preferred embodiment, the sample is obtained from cells or tissues obtained from myocardial infarction patients or cell cultures derived from myocardial infarction patients.
Yet further embodiments relate to a method for determining the set of CN- and expression-invariant loci that can be used as a references for target gene expression measurements, the method comprising:
Yet further embodiments relate to a method for determining the optimal range of gene expression values that can be measured using the CN- and expression-invariant genes as references.
Yet further embodiments relate to CN- and gene expression measurements in ovarian cancer samples.
The present invention is further defined in accordance with the claims appended hereto.
The present invention will now be further described by way of example and with reference to the Figures which show:
For convenience, certain terms employed in the specification and examples are collected here.
The term “aptamer” is herein defined to be oligonucleotide acid or peptide molecule that binds to a specific target molecule. In particular, an aptamer used in the present invention may be generated using different technologies known in the art which include but is not limited to systematic evolution of ligands by exponential enrichment (SELEX) and the like.
The term “comprising” is herein defined to be that where the various components, ingredients, or steps, can be conjointly employed in practicing the present invention. Accordingly, the term “comprising” encompasses the more restrictive terms “consisting essentially of” and “consisting of.” With the term “consisting essentially of” it is understood that the method according to any aspect of the present invention “substantially” comprises the indicated step as an “essential” element. Additional steps may be included.
The term “difference” between two groups of patients is herein defined to be the statistical significance (p-value) of a partitioning of the patients within the two groups. Thus, achieving a “maximum difference” means finding a partition of maximal statistical significance (i.e. minimal p-value).
The term “label” or “label containing moiety” refers to a moiety capable of detection, such as a radioactive isotope or group containing same and non-isotopic labels, such as enzymes, biotin, avidin, streptavidin, digoxygenin, luminescent agents, dyes, haptens, and the like. Luminescent agents, depending upon the source of exciting energy, can be classified as radio luminescent, chemiluminescent, bio luminescent, and photo luminescent (including fluorescent and phosphorescent). A probe described herein can be bound, for example, chemically bound to label-containing moieties or can be suitable to be so bound. The probe can be directly or indirectly labelled.
The term “locus” is herein defined to be a specific location of a gene or DNA sequence on a chromosome. A variant of the DNA sequence at a given locus is called an allele.
The term “copy number (CN) value” or “DNA copy number value” is herein defined to refer to the number of copies of at least one DNA segment (locus) in the genome. The genome comprises DNA segments that may range from a small segment, the size of a single base pair to a large chromosome segment covering more than one gene. This number may be used to measure DNA structural variations, such as insertions, deletions and inversions occurring in a given genomic segment in a cell or a group of cells. In particular, the CN value may be determined in a cell or a group of cells by several methods known in the art including but not limited to comparative genomic hybridization (CGH) microarray, qPCR, electrophoretic separation and the like. CN value may be used as a measure of the copy number of a given DNA segment in a genome. In a single cell, the CN value may be defined by discrete values (0, 1, 2, 3 etc.). In a group of cells it may be a continuous variable, for example, a measure of DNA fragment CN ranging around 2 plus/minus increment d (theoretically or empirically defined variations). This number may be larger than 2+d or smaller than 2-d in the cells with a gain or loss of the nucleotides in a given locus, respectively.
With respect to associations between disease and CN value, a level of variation (deviation) in a DNA segment CN might be important. A level of positive or negative increment of the CN from normal dynamical range in a DNA sample of a given cell group or a single cell may be called CN variation.
The term “sample” is herein defined to include but is not limited to be blood, sputum, saliva, mucosal scraping, tissue biopsy and the like. The sample may be an isolated cell sample which may refer to a single cell, multiple cells, more than one type of cell, cells from tissues, cells from organs and/or cells from tumors.
A person skilled in the art will appreciate that the present invention may be practiced without undue experimentation according to the method given herein. The methods, techniques and chemicals are as described in the references given or from protocols in standard biotechnology and molecular biology text books.
The method according to any aspect of the present invention may be in vitro, or in vivo. In particular, the method may be in vitro, where the steps are carried out on a sample isolated from the subject. The sample may be taken from a subject by any method known in the art. By way of non-limiting example, ovarian tumor material may be extracted from ovaries, fallopian tubes, uterus, vagina and the like. Metastatic tumor samples may be extracted from the peritoneal cavity, other body organs, tissues and the like. Cancer cells may be extracted from non-limiting examples such as biological fluids, which include but are not limited to peritoneal liquid, blood, lymph, urine, products of body secretion and the like.
The term “genomic object” here defines a physical element of a given genome. Examples of a genomic object include (but are not limited to) a chromosome, a chromosomal arm, a plasmid.
The term “locally CN-invariant gene/locus” here defines a gene/locus with the number of copies, averaged across the span of the genomic coordinates of said gene/locus, staying unchanged under any extension of the locus' span within the entire genomic object.
The term “CN-invariant genes/loci in pathological samples”, or pathologically CN-invariant, here defines the genes/loci with average two copies per genome in pathological samples. The pathological samples can be represented by HG-SOC samples. A set of such genes/loci is listed in Table 1.
The term “CN-invariant genes/loci in normal tissues”, or biologically CN-invariant, here defines the genes/loci with average two copies per genome in tissue samples obtained from healthy humans. These samples can be represented by the ones collected in the Thousand Genomes project, for example. A set of such genes/loci is listed in Table 2.
The term CN-invariant genes/loci in human genome here defines the genes/loci being CN-invariant in both pathological and normal tissue samples. A set of such genes/loci is listed in Table 3.
The terms ‘invariant’ and ‘lowest variance’ here are used interchangeably for any data (including, but not limited to gene expression and copy number measurements), where variation across sample groups is not detected.
The terms ‘gene’ and ‘locus’ may be used interchangeably in the cases when the gene expression measurements are uncertain or irrelevant, for example when it is desired to quantify copy number but not gene expression.
The term genomic partition here defines a locus that includes the genomic coordinates of more than one gene.
The term cytoband here defines a genomic region that can be revealed by a standard cytogenetic staining (such as Giemsa staining).
The term human reference genome here defines the sequence annotated as the reference by the Genome Reference Consortium [Church D M, et al., PLoS Biology 9: 1001091 (2011)].
The term “group of biological samples” is here defined as a collection of samples sharing one or more common biological or clinical property. Examples of such properties include (but are not limited to) tissue type, type of cells, source organism, the age of source organism, conditions of cellular growth, environmental conditions, treatment type.
The term normalization function here defines a function taking two arguments (the target and the reference), and returning one value. The function returns the scaling of the target in the units of the reference. The reference may be a single value or a set of values. An example of a normalization function is the ratio of the target value to the reference value. Standard score is an example of a normalization function, where the target is a single value, and the reference is a set: the standard score returns a scaling which is the ratio of the difference between the target value and the mean reference value to the standard deviation of the reference values.
The term normalization here defines a procedure of adjusting the values of the target measurement(s) by the values of the reference measurement(s), referred to as the normalization factor(s), using a normalization function. Typically, the normalization factor is the scaling returned by the normalization function.
The term reference gene here defines a gene that can be used as a normalization reference to obtain measurements of the target gene that would increase the measurements' accuracy upon the normalization.
The term reference locus (plural—loci), also referred to as locus reference, here defines the genomic coordinate range that can be used as a normalization reference(s) for measurements of the target locus or gene that would increase the measurements' accuracy upon normalization.
The term CN-invariant locus reference, also referred to as CNILR, in a given biological sample is here defined as a locus, which is locally CN-invariant; or in a biological sample representing a given group of biological samples the term CN-invariant locus reference is here defined as a locus with a minimal coefficient of variation value of its CN values across said group.
The term CN-invariant survival-insignificant locus reference(s) (CNISILR) in a biological sample representing a given group of biological samples, is defined as a CNILR, whose CN value, or any expression value of the genes within the locus, cannot define more than one subgroup of said group, based on survival prediction analysis.
The term numeric integrative measure here defines a function that takes a set of numeric values as an input and returns a single numeric value as an output. Examples of integrative measures are: mean, median, variance, maximum values.
The term robust measure is here defined as a measure, whose value does not significantly change if outliers are added to the measured data. Robustness of a measure may be defined for a specific measure compared to alternative measures of the same data (e.g. median vs. mean value estimation), or for a class of measures, compared to other classes of measures (e.g. a gene expression value measure with qPCR versus a gene expression microarray).
The term disease status information is here defined as a qualitative or quantitative variable defined for a patient (or a healthy subject) respective to a given disease, e.g. diagnosis, survival status (living or deceased) over a fixed time period, risk group, type of response to therapy, time after first disease recurrence. The particular value of a disease status information variable is here defined as the disease status.
The term disease status-significant genes is here defined as such genes that can stratify a cohort of patients into two or more groups by their given disease status with a given degree of statistical significance.
Most of the genes in the genomes of EOC tumors (TCGA) are affected by CNV (
The genomic regions unaffected by CNV typically spanned for a few megabases. The 851 cytobands containing no CNV, were selected as CN-invariant. The loci (obtained as the genomic coordinates of the longest transcription variants of the respective genes in the RefSeq database) affected by CNV were discarded, and 2841 unaffected genes were selected for further analysis. Among these genes, only 246 located in the CN-invariant cytobands (listed in Table 1). Such genes were considered CN-invariant. These loci and genes could serve as references for CNV measurement in EOC tumor samples.
To find such CN-invariant genes, which could be used as reference genes for both CNV and gene expression measurements, their median expression value and variance had to be assessed. For 157 of these loci (listed in Tables 2 and 3) Affymetrix U133A probes measured the expression of genes located in their genomic coordinates. These genes were considered CN-invariant and were tested for their expression median magnitude and variance across two cohorts of EOC tumors (TCGA and GSE9899).
As an additional criterion of robustness, the gene expression was tested for the significance of their expression values for the survival of the patients, using 1DDg method [Motakis E, et al., IEEE Eng Med Biol Mag 28: 58-66 (2009)]. Potentially, the CN and expression of survival-significant genes might change depending on the subgroup of the patients or treatment options, as the tumors expressing such genes might be subjects of selection. For the TOGA data set 92 genes (whose expression was measured by 121 probesets) satisfied this criterion, while in the GSE9899 data the number of such genes was 82 (with 117 corresponding probesets). Among them, 48 genes (measured with 59 probesets) were insignificant for survival (P>0.05) in both data sets (Table 4).
Actin B (ACTB) is among the genes most widely used as a reference in gene expression measurements with qRT-PCR. However, in the samples where CNV is observed within ACTB, using it as a reference increases the observed variation in the observed values of the copy number and gene expression of assessed genes. The example indicates that in EOC samples all genes of Actin family are characterized with a strong CNV (
Genes, like ACTB, most commonly used as references for gene expression in normal samples, cannot be used as such in EOC samples both in the context of gene expression and copy number measurements, due to their essential CNV. Instead, reference genes should be selected firstly, based on the criteria of the minimal (or absent) CNV in the studied samples. A method implementing such selection is a part of the present invention. Only the genes with no CNV localized in cytobands with non-varying copy number are selected as CNV-invariant genes (
The processed DCHGV (A Deep Catalog of Human Genetic Variation, 1000 Genomes Project) [Abecasis G R, et al., Nature 467: 1061-1073 (2010); Mills R E, et al., Nature 470: 59-65 (2011)] data set containing 89076 frequent gain/loss genomic aberrations in 19354 genes across 1062 samples was used in the analysis. Genes located in CN-invariant cytobands (i.e. cytobands contained no genomic gains or losses) in EOC tumors (TCGA) were filtered through the list of genes with aberrations obtained from the DCHGV. The 41 genes found to be CN-invariant in the TCGA EOC samples, and whose CN at the same time seldomly changed across the 1062 samples of normal human tissues, were considered CN-stable.
To validate the genes selected as CN-invariant in EOC tumors along with the algorithms for selection of such genes, the copy number and expression of a selected set of genes were measured with qRT-PCR in EOC tumors and normal tissues. The list of targets for validation included three genes most often used as expression references for qPCR experiments (ACTB, TBP, and GAPDH) and six genes obtained by using the algorithms described here (AUTS2, EIF5, FHL2, PARN, and YEATS2).
Two sets of primers were designed to detect the amplification of each of these genes in the qPCR reactions measuring either the CN or the expression values (Table 6). For further analyses primer set 2 was used. The primer melting curves demonstrate that all the primers have a single region of annealing in the human genome. Except for XRCC5, each primer pair demonstrates a single melting temperature within 75 to 90 degrees Celsius range (
To find whether the any of the traditional gene expression reference genes (ACTB, GAPDH, and TBP) could serve also as references for gene CN measurements, their CN distribution was evaluated across EOC tumor samples (TCGA cohort). The results demonstrate that CNV in these genes occur in 20 to 100 percent tumors, GAPDH tending to be amplified, and TBP to be deleted (
To assess the effect of the reference genes, the CN of MECOM locus (one of the most frequently amplified in EOC) was normalized by the CN of the reference genes. It would some aspects of a CN measurement with a qPCR-based technique, where the CT values of the target gene is normalized by the CT values of the reference gene (
Ten most common cancers (Table 7), whose combined frequency account for 59% of all cancer cases worldwide, were selected, cross-validation of the loci serving as potential references for the Therascreen EGFR EGQ PCR kit (Qiagen). The six candidate reference loci proposed for ovarian cancer (see Table 6) were compared against ACTB, TBP, and GAPDH as potential normalization controls for the EGFR gene CN measurement (
Ten most common cancers (Table 7), whose combined frequency account for 59% of all cancer cases worldwide, were selected cross-validation of the loci serving as potential references for the Human Breast Cancer PCR array (Qiagen). The six candidate reference loci proposed for ovarian cancer (see Table 6) were compared against ACTB, TBP, and GAPDH as potential normalization controls for the CN measurements of the 23 diagnostic array loci (Table 12). Across the breast invasive carcinoma (
For the lung adenocarcinoma (
Across the breast invasive carcinoma (
An embodiment of the proposed method has been applied to select the candidate loci that could serve as common references to the ten most frequent cancers (Table 7) as follows. First, the loci with the lowest CN variation across the samples of each out of ten cancers (
An embodiment of the proposed method has been applied to select the candidate loci that could serve as common references for tissues from healthy subjects, patients with non-cancerous disease, and cancer-unaffected tissues obtained from cancer patients. The healthy subjects were represented by the 1000 genomes of DCHGV cohort [Abecasis G R, et al., Nature 467: 1061-1073 (2010); Mills R E, et al., Nature 470: 59-65 (2011)] obtained from various tissues. The genomes of the non-cancerous patients were represented by the blood samples of 31 myocardial infarction patients (data set GSE31276).
To assess the CNV in the genomes of the 5290 patients, affected by the 10 most frequent cancers (listed in Table 7), genomic data of Level 3 (as defined by the TCGA data processing methods) was obtained. Each patient was characterized with the genomic data obtained from a pair of a blood sample and a tumor sample. Analyses of the tumor samples of these patients are presented in the Examples 7-9 (the TCGA cohort).
The blood samples of these patients were considered as cancer-unaffected, along with the samples from the DCHGV and the GSE31276 cohorts. Our analysis demonstrated that the total number of loci with the lowest, effectively zero, variation in were 8300, 1231, and 16 loci in the DCHGV, the GSE31276, and the TCGA cohorts, respectively (Table 9;
In the intersections of these three sets, cross-cohort sources of reference loci were identified. A total of 637 loci revealed the lowest variance across both the DCHGV and the myocardial infarction patients' blood genomes, were considered as reference control candidates for non-cancerous genomes (Table 10).
Thee loci (Table 11) are most stable across normal subject, non-cancerous disease subject, and cancer-unaffected tissues of cancer patients. They are regarded as candidate reference loci for CN normalization across all non-cancerous subjects.
Altogether, the cohort-specific and cross-cohort reference loci might be applied to study naturally occurring DNA copy number variations in the blood. These variations might be population-specific and reveal markers of various disease predispositions.
The present invention developed from work on DNA quantification with qPCR. The quantification procedure requires knowledge of both the target locus (or gene) of interest and the locus (or gene) of reference. The DNA of the target locus is quantified by the difference between the PCR amplification cycles counts of the target gene and the reference gene. The main assumption of the method is that for the reference gene the DNA copy number (and hence the PCR amplification cycles count) remains the same for all samples, including the tested and the control ones. In our work we found that this assumption does not hold true for, at least, cancer samples. Since the cancer genome is highly mobile, and its evolution is unpredictable, any gene in the genome can be either amplified or deleted in a large number of cells comprising the cancer cells population. We experimentally observed that this amplification results in highly varying DNA copy numbers of the traditional qPCR reference loci, ACTB and GAPDH. Therefore, we experimentally confirmed that the above assumption is invalid. Moreover, since the RNA level of a gene is a product of the DNA of the same gene (with a non-linear dependence of the former on the latter), the validity of any universal standard loci for RNA quantification is also compromised.
To select a locus suitable as a qPCR reference, we proposed to discard the assumption of a universal reference, and developed procedures that would identify the best reference for a given multitude of samples. For example, the multitude may be defined as ovarian cancer samples (such as in Examples 1, 2, and 3). If we define that the best reference locus (or gene) is a locus, whose DNA copy number value, as measured in a given qPCR setup, simultaneously satisfies two or more conditions: 1) has the smallest variation in all the samples (the specificity criterion), 2) can be detected in all the samples, and/or 3) should not evolve with time or as a result of environmental condition changes (e.g. disease treatments). In patients, the third condition can be ensured by neutrality of the gene's copy number and expression to the patient survival. Thus, the definition of the best reference set dictates the criteria for an unbiased selection of the reference genes. We implemented a computational pipeline (
We carried out an experimental study to check whether the present most popular control loci (ACTB, GAPDH, TBP) satisfy the above conditions and how they compare to the list (Table 1) obtained with our unbiased selection method (see Example 5). We confirmed that: 1) the universal reference assumption does not hold true, since both ACTB and GAPDH reveal DNA copy number variation (
To further predict the performance of our tests for the cases of other cancers and non-cancerous diseases, we carried out a computational study using publicly available high-throughput data obtained from patients diagnosed with ten most common cancer types (Examples 7, 8, and 9), myocardial infarction (Example 10), and a selection of healthy DNA donors from multiple populations across the world (Example 10). We also demonstrated how application of our method can improve the variability of the measurements obtained with two popular in-vitro diagnostic tests (Examples 7 and 8;
The publicly available Affymetrix SNP-6.0 microarray data (described in the Clinical data section) was retrieved from the Gene Expression Omnibus (GEO) repsitory. Each data set was independently normalized using the following steps:
The initial data analysis was carried out with publicly available datasets: TCGA (The Cancer Genome Atlas) [Bell D, et al., Nature 474: 609-15 (2011)], GSE9899 [Tothill R W, et al., Clin Cancer Res 14: 5198-5208 (2008)], and DCHGV (A Deep Catalog of Human Genetic Variation, 1000 Genomes Project) [Abecasis G R, et al., Nature 467: 1061-1073 (2010); Mills R E, et al., Nature 470: 59-65 (2011)].
The National Institute of Health (NIH) Cancer Genome Atlas (TCGA) data set with 514 EOC patients was used for the analysis of CNV, gene expression and patient survival [Bell D, et al., Nature 474: 609-15 (2011)]. The patients, which EOC tumors had EVI1 gene amplified (average EVI1 gene copy number not less than 2.5 per cell), defined here as ‘EVI1 amplified group, were analyzed separately. The 5-year survival for this group of patients was 36 percent. The 5-year survival of the whole patient cohort was 28 percent. The 2-year survival of the whole patient cohort was 74 percent. Gene expression was measured with Affymetrix U133-A microarrays. Copy number was measured with Affymetrix SNP-6.0 CGH microarrays.
Gene Expression Omnibus (NIH) repository was used to obtain the GSE9899 (accession number) data set containing 246 samples [Tothill R W, et al., Clin Cancer Res 14: 5198-5208 (2008)]. From this set 16 patients were removed after a quality control assessment. The 5-year survival of the whole patient cohort was 44 percent. The 2-year survival of the whole patient cohort was 57 percent. Gene expression was measured with Affymetrix U133-Plus-2.0 microarrays.
A Deep Catalog of Human Genetic Variation (DCHGV) was used to obtain data on 202430 natural variations in the human genome reported in 10692 normal human tissue samples. Only variations reported as genomic gains or losses in more than 10 samples at frequencies more than 10% were included in the analysis. In total, 89076 genetic variations were selected, including 24891 cases of genomic gains and 64185 losses in 19354 genes, across 10692 biological samples.
Gene Expression Omnibus (NIH) repository was used to obtain the GSE31276 data set containing 31 individual genome profiles obtained from the blood of myocardial infarction patients. The samples were collected according to the Prospective Cardiovascular Munster study [Assmann G and Schulte H American heart journal 116: 1713-24 (1988)] and Framingham Heart study [Benjamin E J, et al., Circulation 98: 946-52 (1998)].
For validation experiments 48 DNA samples and 80 RNA samples purchased from Origene were used. The 48 DNA samples were extracted from individual serous ovarian adenocarcinoma tumors obtained from: 4 patients with the disease at stage 1, 3 patients at stage 2, 34 patients at stage 3, and 2 patients at stage 4. The 80 RNA samples were extracted from 7 normal fallopian tubes, 21 normal ovaries, and 52 individual serous ovarian adenocarcinoma tumors. The tumors were obtained from 11 patients with the disease at stage 1, 7 patients at stage 2, 29 patients at stage 3, and 5 patients at stage 4. For all 80 RNA samples the cDNA was synthesized using QuantiTect Reverse Transcription Kit 200 (Qiagen; cat. no: 205313).
Number | Date | Country | Kind |
---|---|---|---|
10201502276Q | Mar 2015 | SG | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SG2016/050140 | 3/24/2016 | WO | 00 |