The invention relates to the field of cancer biomarkers, and a process for their identification and use.
The more one knows about a cancer, the more effectively it can be treated. For example, most cancer patients have surgery. However, additional benefits may be possible with additional treatment for some patients. There is not currently a satisfactory approach to determine which patients with cancer would benefit from extra therapy (such as chemotherapy) after surgery. The identification of genes and proteins specific to cancer cells that can be used for prognostic purposes would be helpful in this regard. These genes/proteins which identify tumours associated with a poor prognosis for recovery if treated only by surgery followed by typical standard of care are called poor prognostic biomarkers. These biomarkers can be used as valuable tools for predicting survival after a diagnosis of cancer, for identifying patients for whom the risk of recurrence is sufficiently low that the patient is likely to progress as well or better in the absence of post-surgery chemotherapy and/or radiation treatment or with only typical standard of care treatment post-surgery, and for guiding how oncologists should treat the cancer to obtain the best outcome.
Similarly, there are genes expressed in cancers which play a role in drug response. It would be useful to have information on predicted drug response when making clinical decisions.
To provide a screening tool with sufficient precision to be of clinical interest, it should preferably consider multiple markers for a type of cancer. A single gene marker does not provide a sufficient level of specificity and sensitivity. By way of example, microarray technology, which can measure more than 25,000 genes at the same time provides a useful tool to find multi-markers.
It is an object of the invention to provide sets of markers for use in identifying tumour characteristics of interest and a process for their identification and use.
The present invention in one embodiment teaches the usage of gene expression profiles to distinguish ‘good’ and ‘bad’ tumours based on groups of genes. As used herein when referring to predictors and patient survival, the term “good tumour” refers to a tumour which is likely to be cured by surgery and only typical standard of care, without chemotherapy or radiation treatment (even if this is part of the typical standard of care). As used herein, the term “bad tumour” refers to a tumour which is not likely to be cured by surgery and only typical standard of care including chemotherapy or radiation treatment. As used herein, a tumour is “cured” if the patient has not experienced a recurrence of the tumour (or a metastasis of it) within 5 or 10 years of surgery.
It is possible to identify sets of genes whose expression profiles are able to distinguish ‘good’ and ‘bad’ tumours. The prior art discloses five such gene expression signal sets and these have been developed as biomarkers for breast cancer samples. Each gene expression signal set was derived from a set of breast tumour samples. However, these five biomarker sets can't be cross-used. Specifically, the prior art so-called “breast cancer biomarkers” have not been found to be consistently predictive of prognosis when used in another set of breast tumour samples. Biomarkers for other types of cancers have the same problem. Cancer is highly heterogeneous. Frequently for a type of cancer several subtypes can be found. Previously disclosed marker sets are not universal enough for these subtypes.
To overcome these problems and the limitation of dataset (sample) availability, a new approach to finding and using sets of biomarkers was developed.
In one embodiment of the invention, random training datasets were generated from a published cancer dataset, in which gene expression profiles and clinical information of the patients had been included, to find robust sets of biomarkers'. Gene expression profiles of the random training dataset were correlated with patient survival status and to screening biomarkers.
In one embodiment of the invention there is provided a method of identifying biomarkers, said method comprising:
A “gene expression signal” is a tangible indicator of expression of a gene, such as mRNA or protein.
In an embodiment of the invention there is provided a process to identify tumour characteristics, said process comprising the following steps:
In some cases, the characteristic of concern relates to one or more of: metastisis, inflammation, cell cycle, immunological response genes, drug resistance genes, and multi-drug resistance genes. In some cases the tumour characteristic is responsible to a particular treatment or combination of treatments.
In some cases the tumour characteristic is a tendency to lead to poor patient survival post-surgery.
In some cases, the tumour characteristic is related to patient survival and step 4 of the process above comprises assigning a value to the extracted gene expression signals according to the following rankings:
In cases where the cancer has more than one subtype, it may be desirable to include the preliminary steps of:
In some cases, the tumour characteristic of interest is the tendency of the tumour to respond to particular treatments, such as chemotherapeutic agents or radiation. In such a case, the gene expression signals are correlated with tumour drug response in the process of developing the training sets. It will be understood that a “good” tumour response to a particular drug would be below-average tumour survival following treatment and a “bad” response would be above-average tumour survival following treatment. Using this approach, and depending on the detail available in the original tumour and clinical data used in developing the training sets, it is possible to develop markers not only for response to individual drugs or treatments, but to combinations of treatments (where there is sufficient data in the original source to permit this).
In an embodiment of the invention there is provided a process for determining predictive gene expression signal sets of the type useful in the processes described above comprising the following steps:
In one embodiment of the invention there is provided a process of identifying patients in need of more or less aggressive treatment than the typical standard of care, said process comprising:
In some cases, for this process it will be desirable to group the selected identified gene expression signals according to their role in biological process using Gene Ontology analysis.
Preferably between 30 and 50 random training sets are created. More preferably, between 30 and 40 training sets are created.
It will sometimes be desirable to select the genes know to be active in cancer from the groups of genes responsible for metastasis, cell proliferation, tumour vascularisation, and drug response.
In some embodiments of the invention involving the process described above, in step 7, between about 750,000 and 1,250,000, or between about 900,000 and 1,100,000 or about a million random gene expression signal sets are generated. In some embodiments of the invention as described in the process above, in step 7, the random gene expression signal sets generated contain between about 25 and 50, or 28-32 or about 30 genes.
In an embodiment of the invention as described in the process above, in step 12 the top 26-50, or 28-32 or about 30 genes are selected.
In some cases when considering tumour characteristics relating to patient survival, it will be desirable to employ at least one cancer biomarker set selected from the list consisting essentially of NRC-1, NRC-2, NRC-3, NRC-4, NRC-5, NRC-6, NRC-7, NRC-8, and NRC-9.
In an embodiment of the invention there is provided a kit comprising at least three marker sets and instructions to carry out the process described above in order to identify a tumour characteristic of interest. In some cases, the kit will comprise at least 10 gene expression signals listed in Table 1A or 1 B. In some cases, the kit will comprise at least 30 nucleic acid biomarkers identified according to the process described above.
In an embodiment of the invention there is provided the use of any of the gene expression signals in Table 1A or 1B in identifying one or more tumour characteristics of interest. In some cases, at least different three markers sets are used in some cases at least 1, 2, or 3 of the marker sets including at least 1, 5, 10, 20, or 25 of the gene expression signals found in Table 1A or 1 B. In some cases each marker set contains at least 1, 5, 10, 20 or 25 of the gene expression signals found in Table 1A or 1 B.
In an embodiment of the invention, the cancer biomarkers are breast cancer biomarkers and the first subtype of sample is an ER+ sample.
In an embodiment of the invention, in the process described above, the random training sets are generated by randomly picking samples while maintaining the same ratio of “good” and “bad” tumours as that in the set from which they are chosen.
In some cases, the tumour characteristic(s) of interest will relate to patient survival (for example, following surgery and typical standard of care) and in such cases, the method may be used to identify patients in need of more or less aggressive treatment than the typical standard of care. (Chemotherapy and radiation treatment are, in themselves, hazardous. Thus, it is best to avoid providing such treatment to patients who do not need them.)
In some cases, it will be desirable to study tumour tissue for a patient by extracting gene expression signals (e.g. mRNA, protein) and assaying the presence (and in some cases level) of gene expression signals of interest using a reporter specific for the gene expression signal of interest. This may be done in a micro-array format permitting examination of multiple gene expression signals essentially simultaneously. A reporter may be a probe which binds to a nucleic acid sequence of interest, an antibody specific to a protein of interest, or any other such material (many such reporters are known in the art and used routinely). The reporter effects a change in the sample permitting assessment of the gene expression signal of interest. In some cases the change effected may be a change in an optical aspect of the sample, in other cases the change may be a change in another assayable aspect of the sample such as its radioactive or fluorescent properties.
In situations where a particular type of cancer has more than one subtype (eg. ER+ and ER− breast cancers), it will be preferable to classify the patient's cancer by subtype initially, and then use markers developed in relation to that subtype.
In some cases, the tumour characteristic(s) of interest will relate to tumour response to particular treatment(s) and in such cases, the method may be used to identify promising treatment approaches (one or more chemotherapeutics or combinations of treatments) for the patient having the tumour.
As used herein “tumour” includes any cancer cell which it is desirable to destroy or neutralize in a patient. For example, it may include cancer cells found in solid tumours, myelomas, lymphomas and leukemias.
Tumours will generally be mammalian or bird tumours and may be tumours of: human, ape, cat, dog, pig, cattle, sheep, goat, rabbit, mouse, rat, guinea pig, hamster, gerbil, chicken, duck, or goose.
It will be apparent that the combinatorial use of three independent sets of gene expression signals is not limited to gene expression signals produced according to the approach described herein, but may also be applied to cancer biomarker datasets sold commercially or reported in the literature. (Although the reliability of the final screening result will depend to some extend on the robustness of the sets used and therefore it is recommended to use cancer biomarker datasets which are robust). In some instances it will be desirable to select cancer biomarker datasets comprising genes involved in different biological processes (E.g. one dataset might relate to inflammation, another to cell cycle and the third to metastasis.)
The process is general and may be applied to any type of cancer. For example it is useful in relation to those cancer types listed in Table 4.
In an embodiment of the invention, the process is applied to determine how aggressively a breast cancer patient should be treated post-surgery.
One embodiment of the process is provided below, in parallel with a description of Example 1:
In Example 1, another 3 sets of markers (called NRC-7, -8 and -9, respectively. Each set contains 30 genes, see Table 1) were obtained. These sets were used for ER− samples.
In example 1, for each marker set, nearest shrunken centroid classification and leave-one-out methods were employed. We then combinatory used 3 marker sets together for predicting the recurrence of each sample.
For a given dataset, which contains n samples, the test process used in Example 1 was the following (step by step):
For predicting the recurrence of the targeted testing sample using the marker set: we compare the modified gene expression profile of the sample to each of these modified class centroids. The class whose centroid that it is closest to, in squared distance, is the predicted class for that sample. If the sample is predicted as “good” tumour, it is denoted as 0, otherwise, it is denoted as 1.
To test the robustness and predicting accuracy of the marker sets, we tested the marker sets in three independent breast cancer datasets from these publications (Koe et al., Cancer Cell, 2006; Chang et al., PNAS 102:3738, 2005 and Sotiriou C, et al., J. Natl Cancer Inst, 98:262, 2006), In total, 644 samples were tested.
For ER+ samples, in each dataset, we first used NRC-1, -2 and -3 marker sets (from the three breast cancer datasets mentioned above) to stratify the samples into low (LG), intermediate (MG) and high (HG)-risk groups. If the high-risk group had less than 10 samples, we merged MG and HG groups and called it intermediate-risk group. Otherwise, we used NRC-4, -5 and -6 marker sets to stratify the HG group into three new groups: low (NLG), intermediate (NMG) and high (NHG)-risk groups. We merged NLG and MG and called it intermediate-risk group, and merged NMG and NHG and called it a high-risk group. The LG is low-risk group. We obtained very good results with high predictability accuracy (−90% for non-recurrence patients) for the low-risk group and classified three groups nicely in all the 3 testing datasets (See table 2).
For ER− samples, in each dataset, we used NRC-7, -8 and -9 marker sets to stratify the samples into low (LG-) and high (HG-)-risk groups. We also obtained very good results with high predicting accuracy (˜92-100% for non-recurrence patients) for the low-risk group and classified two groups nicely in all the 3 testing datasets (See table 2).
For ER+ samples, when NRC-1, NRC-2 and NRC-3 are all in agreement to predict the sample as “good” tumour, the accuracy was significantly improved than using a single marker set, such as NRC-1, NRC-2 or NRC-3 (Table 3). The same results were obtained when NRC-7, NRC-8 and NRC-9 are all in agreement to predict the sample as “good” tumour for ER− samples (Table 3). In general, it is found that the integrative usage of 3 marker sets improves predictive accuracy over using a single set. In one embodiment of the invention accuracy was improved from about 70% to about 90%. In one embodiment of the invention, accuracy is at least 90%. In another embodiment it is at lease 95%.
Thus, there is provided herein robust sets of biomarkers and uses thereof.
It will be understood that, depending on the type of cancer, and the condition of the patient, different gene profiles may be considered “bad”. Metastasis is generally considered to be a significant factor in the decision about how to treat a patient with cancer and sets of biomarker sets, such as those disclosed herein, are useful for that purpose. In addition, biomarker sets can be used to identify cancer cell types which are likely to respond well (or poorly) to one or more particular drugs. Regardless of the exact factors being considered as “good” or “bad”, it will usually be desirable to begin the process with training sets S1 and S2 containing both “good” and “bad” genes. Level of gene expression may be considered when identifying good drug targets since highly-expressed targets frequently make good drug targets.
In general, the low-risk group (having “good prognostic signature”) will not go to treatment, but high-risk group (having “poor prognostic signature”) should receive treatment in addition to surgery. Generally, the intermediate-risk group will do so as well; however, this will depend on the typical standard of care for that type of tumour.
While each of the biomarker sets disclosed herein is, individually, useful in predicting the need for additional treatment, overall prediction accuracy can be markedly improved by the use of multiple biomarker sets.
For example, if a patient sample is screened against NRC—1, NRC—2 and NRC—3 and all three sets indicate “good” prognosis, the patient is considered to be low risk. If all indicate “bad” prognosis, the sample is considered to be high risk. If one or two sets say “bad” and the other(s) says “good”, the cancer is considered to be intermediate risk.
In an embodiment of the invention, in order to determine if a patient sample is “good” or “bad” in relation to any one biomarker set (e.g. NRC—1), the biomarker set is used to independently screen two banks of cancer cells representing samples from a large number of patients. The first bank represents “good” cancer cells (with a known clinical history of not exhibiting the behaviour or characteristic of concern, such as metastasis) and the second bank represents “bad” cancer cells (with a known clinical history of exhibiting the behaviour or characteristic of concern). Each of the “good” and “bad” banks will produce a gene expression signature (standard “good” and “bad” gene expression signatures for “good” and “bad” tumours), respectively, for each biomarker set. For a patient sample, the gene expression signature of a biomarker set of the patient sample is compared to the standard “good” and “bad” gene expression signatures of that biomarker set. Those patient samples which most closely resemble the standard “bad” signature of that biomarker set are considered “bad” and those which most closely resemble the standard “good” signature of that biomarker set are considered “good.”
The method may in some cases involve the combinatory using of one or more of the following cancer biomarker sets: NRC-1, NRC-2, NRC-3, NRC-4, NRC-5, NRC-6, NRC-7, NRC-8, NRC-9.
Example of one possible approach to using the process when a subtype has been identified (for this example ER+/ER−)−:
In an embodiment of the invention there is provided a method of assessing the likelihood of a patient benefiting form additional cancer treatment in addition to surgery, said method comprising:
Detailed information for making microarray gene chip, scanning and normalization of array data can be found at Agilent company website:
http://www.chem.agilent.com/en-US/products/instruments/dnamicroarrays/pages/default.aspx. and in the publicly available literature.
indicates data missing or illegible when filed
The format of sequences is a FASTA format. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column.
An example sequence in FASTA:
In the description line, the first item, 6019 is NCBI EntrezGene ID, which is the ID in the first column of Table 1; another item after the symbol (“|”) is the NCBI reference message RNA sequence ID. It should be noted that one EntrezGene ID may have several reference message RNA sequences. In this case, all the message RNA sequences for one EntrezGene ID are listed. Each sequence represents one reference message RNA sequence.
indicates data missing or illegible when filed
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CA10/00565 | 4/16/2010 | WO | 00 | 10/7/2011 |
Number | Date | Country | |
---|---|---|---|
61202881 | Apr 2009 | US |