The likelihood of long-term mortality for patients with lung cancer is poorly defined by clinical stage and histopathological findings. Our hypothesis was that a multigene quantitative polymerase chain reaction (PCR) assay can predict risk of mortality among patients with lung cancer.
In one aspect, the present invention provides a method of providing a prognosis for lung cancer in a subject, the method comprising the steps of: (a) contacting a biological sample from the subject with reagents that specifically bind to a panel of biomarkers comprising ctnnb1 wnt3a, tp53, kras, erbb3, muc1, erbb2, and dusp6, and (b) determining whether or not the marker is differentially expressed in the sample by comparing the sample to a control non-cancerous cell sample; thereby providing a prognosis for lung cancer.
In one embodiment, the reagent is a nucleic acid, In another embodiment, the reagent is an oligonucleotide. In another embodiment, the reagent is an PCR primer set. In another embodiment, the reagent is an antibody.
In one embodiment, the sample is from a surgically resected tumor. In another embodiment, the sample is from a lung tissue or lung tumor biopsy.
In one aspect, the present invention provides a kit comprising reagents that specifically bind to a panel of biomarkers comprising ctnnb1, wnt3a, tp53, kras, erbb3, muc1, erbb2, and dusp. In one embodiment, the reagent is an PCR primer set.
In another aspect, the invention features a method of determining the prognosis of a subject having a lung cancer by quantifying in a biological sample the expression levels of at least two (e.g., at least three or at least four) genes selected from a group of: Rnd3, wnt3a, erbb3, lck, sh3bgr, fut3, il11, cdc6, cdk2ap1, bag1, emx2, six3, and brca1. In this aspect, the biological sample (e.g., a tumor biopsy, a lung biopsy, and a blood sample) is derived from the subject and the expression levels are indicative of the prognosis.
In any of the forgoing aspects, the expression levels can be mRNA expression levels or protein levels, or combination of both. The invention features determining mRNA expression levels through, for example, quantitative rtPCR. The invention features determining protein levels through, for example, an antibody binding assay (e.g., an ELISA assay).
In yet another aspect, the invention features a method of determining the prognosis of a subject having a lung cancer by measuring in a biological sample the methylation levels of at least two (e.g., at least three or at least four) genes selected from a group consisting of: Rnd3, wnt3a, erbb3, lck, sh3bgr, fut3, il11, cdc6, cdk2ap1, bag1, emx2, six3, and brca1. In this aspect, the biological sample (e.g., a tumor biopsy, a lung biopsy, and a blood sample) is derived from the subject and the methylation levels are indicative of the prognosis.
In any of the forgoing aspects, the lung cancer can be lung adencarcinoma, e.g., stage I, stage II, stage III, or stage IV,
Also, in any of the forgoing aspects, the prognosis provides a high or low risk. assessment for long term mortality.
The invention features the identification of expression profiles of certain groups of genes which allows accurate prognosis of long term mortality in early stage lung cancer. In one embodiment, we identified 65 genes that were previously identified as prognostic for long-term mortality in early stage lung cancer in 3 published microarray studies and 2 PCR-based studies.
RNA was extracted from 124 fresh-frozen tumor samples from consecutive patients with completely resected lung adenocarcinoma with at least 3 years of clinical follow-up. 80 samples were randomly assigned to a test group and the remainder assigned to a validation group. Real-time PCR of the 65 identified genes were run on the test set using Tag-man assays. A prediction model was created using a proportional hazards model of normalized gene expression levels using backwards model selection. A model score was calculated for each patient using model coefficients and individual gene expression levels. Patients were defined as high-risk if the model score was greater than the median score.
Adequate real-time PCR profiles were identified in all 80 patients. Eighteen genes were included in the final model. The proportion of patients identified as high-risk and low-risk was 52% and 48% percent, respectively. The Kaplan-Meier estimated five-year survival in the low-risk group was 82% and 5% in the high-risk group (P<0.001, log-rank test). Median survival was 22 months in the high-risk group and was not reached in the low-risk group. In multivariate survival analysis, the prognostic score predicted survival independent of tumor stage and size (P<0.001). Prognostic score predicted mortality better than clinical stage, based on model log-likelihood values (P<0.001). For clinical staging, see, e.g., Mountain, Clifton F; Herman I Libshitz, Kay E Hermes. A Handbook for Staging, Imaging, and Lymph Node Classification. Charles P Young Company, and Mountain, C F (1997). “Revisions in the international system for staging lung cancer”. Chest 111: 1710-1717.
In another embodiment, the invention features the quantification of expression of the following genes as a prognostic indicator or mortality due to lung cancers: rnd3, wnt3a, erbb3, lck, sh3bgr, fut3, il11, cdc6, cdk2ap1, bag1, emx2, six3, and brca1 (e.g., wnt3a, rnd3, lck, and erbb3) (See Example 2 below).
The eight membered multi-gene RT-PCR assay (ctnnb1, wnt3a, tp53, kras, erbb3, muc1, erbb2, and dusp6), the 13 membered multi-gene RT-PCR assay (rnd3, wnt3a, erbb3, lck, sh3bgr, fut3, il11, cdc6, cdk2ap1, bag1, emx2, six3, and brca1), or any multi-gene RT-PCR assay that includes wnt3a, rnd3, lck, and erbb3, aid in predicting long-term mortality among patients with lung cancer.
The invention also comprises a multigene diagnostic kit, composed of the markers described herein that can be used to provide a prognosis for lung cancer patients.
“Lung cancer” refers generally to two main types of lung cancer categorized by the size and appearance of the malignant cells: non-small cell (80%) and small-cell (roughly 20%) lung cancer. “Non-small cell lung cancer”(NSCLC) includes squamous cell carcinoma, accounting for approximately 29% of lung cancers. Lung adenocarcinoma is the most common subtype of NSCLC, accounting for approximately 32% of lung cancers. A subtype of lung adenocarcinoma, the bronchioloalveolar carcinoma, is more common in female never-smokers. Large cell carcinoma accounts for approximately 9% of lung cancers. “Small cell lung cancer” includes SCLC (also called “oat cell carcinoma”), a less common form of lung cancer. Other types of lung cancer include carcinoid, adenoid cystic carcinoma, cylindroma, and mucoepidermoid carcinoma. In one embodiment, lung cancers are staged according to stages I-IV, with I being an early stage and IV being the most advanced.
“Prognosis” refers, e.g., to overall survival, long term mortality, and disease free survival. In one embodiment, long term mortality refers to survival 5 years after diagnosis of lung cancer. In one embodiment, the prognosis for long term mortality is “high risk,” e.g., high risk of mortality, or “low risk,” e.g., low risk of mortality. The stage of cancer and the prognosis may be used to tailor a patients therapy to provide a better outcome, e.g., targeted therapy and surgery, surgery alone, or targeted therapy alone.
Other forms of cancer include carcinomas, sarcomas, adenocarcinornas, lymphomas, leukemias, etc., including solid and lymphoid cancers, head and neck cancer, e.g., oral cavity, pharyngeal and tongue cancer, kidney, breast, kidney, bladder, colon, ovarian, prostate, pancreas, stomach, brain, head and neck, skin, uterine, testicular, esophagus, and liver cancer, including hepatocarcinoma, lymphoma, including non-Hodgkin's lymphomas (e.g., Burkitt's, Small Cell, and Large Cell lymphomas) and Hodgkin's lymphoma, leukemia, and multiple myeloma.
The term “marker” refers to a molecule (typically protein, nucleic acid, carbohydrate, or lipid) that is expressed in the cell, expressed on the surface of a cancer cell or secreted by a cancer cell in comparison to a non-cancer cell, and which is useful for the diagnosis of cancer, for providing a prognosis, and for preferential targeting of a pharmacological agent to the cancer cell. Oftentimes, such markers are molecules that are overexpressed in a lung cancer or other cancer cell in comparison to a non-cancer cell, for instance, 1-fold overexpression, 2-fold overexpression, 3-fold overexpression or more in comparison to a normal cell. Further, a marker can be a molecule that is inappropriately synthesized in the cancer cell, for instance, a molecule that contains deletions, additions or mutations in comparison to the molecule expressed on a normal cell. Alternatively, such biomarkers are molecules that are underexpressed in a cancer cell in comparison to a non-cancer cell, for instance, 1-fold underexpression, 2-fold underexpression, 3-fold underexpression, or more. Further, a marker can be a molecule that is inappropriately synthesized in cancer, for instance, a molecule that contains deletions, additions or mutations in comparison to the molecule expressed on a normal cell.
It will be understood by the skilled artisan that markers may be used in combination with other markers or tests for any of the uses, e.g., prediction, diagnosis, or prognosis of cancer, disclosed herein.
“Biological sample” includes sections of tissues such as biopsy and autopsy samples, and frozen sections taken for histologic purposes. Such samples include blood and blood fractions or products (e.g., serum, platelets, red blood cells, and the like), sputum, bronchoalveolar lavage, cultured cells, e.g., primary cultures, explants, and transformed cells, stool, urine, etc. A biological sample is typically obtained from a eukaryotic organism, most preferably a mammal such as a primate e.g., chimpanzee or human; cow; dog; cat; a rodent, e.g., guinea pig, rat, Mouse; rabbit; or a bird; reptile; or fish.
A “biopsy” refers to the process of removing a tissue sample for diagnostic or prognostic evaluation, and to the tissue specimen itself Any biopsy technique known in the art can be applied to the diagnostic and prognostic methods of the present invention. The biopsy technique applied will depend on the tissue type to be evaluated (e.g., lung etc.), the size and type of the tumor, among other factors. Representative biopsy techniques include, but are not limited to, excisional biopsy, incisional biopsy, needle biopsy, surgical biopsy, and bone marrow biopsy. An “excisional biopsy” refers to the removal of an entire tumor mass with a small margin of normal tissue surrounding it. An “incisional biopsy” refers to the removal of a wedge of tissue that includes a cross-sectional diameter of the tumor. A diagnosis or prognosis made by endoscopy or fluoroscopy can require a “core-needle biopsy”, or a “fine-needle aspiration biopsy” which generally obtains a suspension of cells from within a target tissue. Biopsy techniques are discussed, for example, in Harrison's Principles of Internal Medicine, Kasper, et al., eds., 16th ed., 2005, Chapter 70, and throughout Part V.
The terms “overexpress,” “overexpression” or “overexpressed” interchangeably refer to a protein or nucleic acid (RNA) that is transcribed or translated at a detectably greater level, usually in a cancer cell, in comparison to a normal cell. The term includes overexpression due to transcription, post transcriptional processing, translation, post-translational processing, cellular localization (e.g., organelle, cytoplasm, nucleus, cell surface), and RNA and protein stability, as compared to a normal cell. Overexpression can be detected using conventional techniques for detecting mRNA (i.e., RT-PCR, PCR, hybridization) or proteins (i.e., ELISA, immunohistochemical techniques). Overexpression can be 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or more in comparison to a normal cell. In certain instances, overexpression is 1-fold, 2-fold, 3-fold, 4-fold or more higher levels of transcription or translation in comparison to a normal cell.
The terms “underexpress,” “underexpression” or “underexpressed” or “d/ownregulated” interchangeably refer to a protein or nucleic acid that is transcribed or translated at a delectably lower level in a cancer cell, in comparison to a normal cell. The term includes underexpression due to transcription, post transcriptional processing, translation, post-translational processing, cellular localization (e.g., organelle, cytoplasm, nucleus, cell surface), and RNA and protein stability, as compared to a control. Underexpression can be detected using conventional techniques for detecting mRNA (i.e., RT-PCR, PCR, hybridization) or proteins (i.e., ELISA, immunohistochemical techniques). Underexpression can be 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or less in comparison to a control. In certain instances, underexpression is 1-fold, 2-fold, 3-fold, 4-fold or more lower levels of transcription or translation in comparison to a control.
The term “differentially expressed” or “differentially regulated” refers generally to a protein or nucleic acid that is overexpressed (upregulated) or underexpressed (downregulated) in one sample compared to at least one other sample, generally in a cancer patient, in comparison to a patient without cancer, in the context of the present invention.
“Therapeutic treatment” and “cancer therapies” refers to chemotherapy, hormonal therapy, radiotherapy, immunotherapy, and biologic (targeted) therapy.
By “therapeutically effective amount or dose” or “sufficient amount or dose” herein is meant a dose that produces effects for which it is administered. The exact dose will depend on the purpose of the treatment, and will be ascertainable by one skilled in the art using known techniques (see, e.g., Lieberman, Pharmaceutical Dosage Forms (vols. 1-3, 1992); Lloyd, The Art Science and Technology of Pharmaceutical Compounding (1999); Pickar, Dosage Calculations (1999); and Remington: The Science and Practice of Pharmacy , 20th Edition, 2003, Gennaro, Ed., Lippincott, Williams & Wilkins).
The terms “identical” or percent “identity,” in the context of two or more nucleic acids or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (i.e., about 60% identity, preferably 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher identity over a specified region, when compared and aligned for maximum correspondence over a comparison window or designated region) as measured using a BLAST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection (see, e.g., NCBI web site ncbi.nlm.nih.gov/BLAST or the like). Such sequences are then said to be “substantially identical.” This definition also refers to, or may be applied to, the compliment of a test sequence. The definition also includes sequences that have deletions and/or additions, as well as those that have substitutions. As described below, the preferred algorithms can account for gaps and the like. Preferably, identity exists over a region that is at least about 25 amino acids or nucleotides in length, or more preferably over a region that is 50-100 amino acids or nucleotides in length. The biomarkers described herein can be detected with probes that have, e.g., more than 70% identity over a specified region, or more than 80% identity, or more than 90% identity to the reference sequence provided by the accession number, up to 100% identity.
For sequence comparison, typically one sequence acts as a reference sequence, to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are entered into a computer, subsequence coordinates are designated, i f necessary, and sequence algorithm program parameters are designated. Preferably, default program parameters can be used, or alternative parameters can be designated. The sequence comparison algorithm then calculates the percent sequence identities for the test sequences relative to the reference sequence, based on the program parameters.
A “comparison window,” as used herein, includes reference to a segment of any one of the number of contiguous positions selected from the group consisting of from 20 to 600, usually about 50 to about 200, more usually about 100 to about 150 in which a sequence may be compared to a reference sequence of the same number of contiguous positions after the two sequences are optimally aligned. Methods of alignment of sequences for comparison are well-known in the art. Optimal alignment of sequences for comparison can be conducted, e.g., by the local homology algorithm of Smith & Waterman, Adv. Appl. Math . 2:482 (1981), by the homology alignment algorithm of Needleman & Wunsch, J. Mol Biol . 48:443 (1970), by the search for similarity method of Pearson & Lipman, Proc. Nat'l, Acad. Sci. USA 85:2444 (1988), by computerized implementations of these algorithms (GAP, BESTFIT, PASTA, and TFASTA in the Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, Wis.), or by manual alignment and visual inspection (see, e.g., Current Protocols in Molecular Biology (Ausubel et al., eds. 1987-2005, Wiley Interscience)).
A preferred example of algorithm that is suitable for determining percent sequence identity and sequence similarity are the BLAST and BLAST 2.0 algorithms, which are described in Altschul et al., Nuc. Acids Res . 25:3389-3402 (1977) and Altschul et al., J. Mol. Biol. 215:403-410 (1990), respectively. BLAST and BLAST 2.0 are used, with the parameters described herein, to determine percent sequence identity for the nucleic acids and proteins of the invention. Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/). This algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold (Altschul et al., supra). These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always>0) and N (penalty score for mismatching residues; always<0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached. The BLAST algorithm parameters W, T, and X determine the sensitivity and speed of the alignment. The BLASTN program (for nucleotide sequences) uses as defaults a wordlength (W) of 11, an expectation (E) of 10, M=5, N=−4 and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a wordlength of 3, and expectation (E) of 10, and the BLOSUM62 scoring matrix (see Henikoff & Henikoff, Proc. Natl. Acad. Sci. USA 89:10915 (1989)) alignments (B) of 50, expectation (E) of 10, M=5, N=−4, and a comparison of both strands.
“Nucleic acid” refers to deoxyribonucleotides or ribonucleotides and polymers thereof in either single-or double-stranded form, and complements thereof. The term encompasses nucleic acids containing known nucleotide analogs or modified backbone residues or linkages, which are synthetic, naturally occurring, and non-naturally occurring, which have similar binding properties as the reference nucleic acid, and which are metabolized in a manner similar to the reference nucleotides. Examples of such analogs include, without limitation, phosphorothioates, phosphoramidates, methyl phosphonates, chiral-methyl phosphonates, 2-O-methyl ribonucleotides, peptide-nucleic acids (PNAs).
Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences, as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batter et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka el al., J. Biol. Chem. 260:2605-2608 (1985); Rossolini et al., Mol. Cell. Probes 8:91-98 (1994)). The term nucleic acid is used interchangeably with gene, cDNA, mRNA, oligonucleotide, and polynucleotide.
A particular nucleic acid sequence also implicitly encompasses “splice variants” and nucleic acid sequences encoding truncated forms of a protein. Similarly, a particular protein encoded by a nucleic acid implicitly encompasses any protein encoded by a splice variant or truncated form of that nucleic acid. “Splice variants,” as the name suggests, are products of alternative splicing of a gene. After transcription, an initial nucleic acid transcript may be spliced such that different (alternate) nucleic acid splice products encode different polypeptides. Mechanisms for the production of splice variants vary, but include alternate splicing of exons. Alternate polypeptides derived from the same nucleic acid by read-through transcription are also encompassed by this definition. Any products of a splicing reaction, including recombinant forms of the splice products, are included in this definition. Nucleic acids can be truncated at the 5′ end or at the 3′ end. Polypeptides can be truncated at the N-terminal end or the C-terminal end. Truncated versions of nucleic acid or polypeptide sequences can be naturally occurring or reconibinantly created.
The terms “polypeptide,” “peptide” and “protein” are used interchangeably herein to refer to a polymer of amino acid residues, The terms apply to amino acid polymers in which one or more amino acid residue is an artificial chemical mimetic of a corresponding naturally occurring amino acid, as well as to naturally occurring amino acid polymers and non-naturally occurring amino acid polymer.
The term “amino acid” refers to naturally occurring and synthetic amino acids, as well as amino acid analogs and amino acid mimetics that function in a manner similar to the naturally occurring amino acids. Naturally occurring amino acids are those encoded by the genetic code, as well as those amino acids that are later modified, e.g., hydroxyproline, γ-carboxyglutamate, and O-phosphoserine. Amino acid analogs refers to compounds that have the same basic chemical structure as a naturally occurring amino acid, i.e., an α carbon that is bound to a hydrogen, a carboxyl group, an amino group, and an R group, e.g., homoserine, norleucine, methionine sulfoxide, methionine methyl sulfonium. Such analogs have modified R groups (e.g., norleucine) or modified peptide backbones, but retain the same basic chemical structure as a naturally occurring amino acid. Amino acid mimetics refers to chemical compounds that have a structure that is different from the general chemical structure of an amino acid, but that functions in a manner similar to a naturally occurring amino acid.
Amino acids may be referred to herein by either their commonly known three letter symbols or by the one-letter symbols recommended by the IUPAC-IUB Biochemical Nomenclature Commission. Nucleotides, likewise, may be referred to by their commonly accepted single-letter codes.
“Conservatively modified variants” applies to both amino acid and nucleic acid sequences. With respect to particular nucleic acid sequences, conservatively modified variants refers to those nucleic acids which encode identical or essentially identical amino acid sequences, or where the nucleic acid does not encode an amino acid sequence, to essentially identical sequences. Because of the degeneracy of the genetic code, a large number of functionally identical nucleic acids encode any given protein. For instance, the codons GCA, GCC, GCG and GCU all encode the amino acid alanine. Thus, at every position where an. alanine is specified by a codon, the codon can be altered to any of the corresponding codons described without altering the encoded polypeptide. Such nucleic acid variations are “silent variations,” which are one species of conservatively modified variations. Every nucleic acid sequence herein which encodes a polypeptide also describes every possible silent variation of the nucleic acid. One of skill will recognize that each codon in a nucleic acid (except AUG, which is ordinarily the only codon for methionine, and TGG, which is ordinarily the only codon for tryptophan) can be modified to yield a functionally identical molecule. Accordingly, each silent variation of a nucleic acid which encodes a polypeptide is implicit in each described sequence with respect to the expression product, but not with respect to actual probe sequences.
As to amino acid sequences, one of skill will recognize that individual substitutions, deletions or additions to a nucleic acid, peptide, polypeptide, or protein sequence which alters, adds or deletes a single amino acid or a small percentage of amino acids in the encoded sequence is a “conservatively modified variant” where the alteration results in the substitution of an amino acid with a chemically similar amino acid. Conservative substitution tables providing functionally similar amino acids are well known in the art, Such conservatively modified variants are in addition to and do not exclude polymorphic variants, interspecies homologs, and alleles of the invention.
The following eight groups each contain amino acids that are conservative substitutions for one another: 1) Alanine (A), Glycine (G); 2) Aspartic acid (I)), Glutamic acid (E); 3) Asparagine (N), Glutamine (Q); 4) Arginine (R), Lysine (K); 5) Isoleucine (I), Leucine (L), Methionine (M), Valine (V); 6) Phenylalanine (F), Tyrosine (Y), Tryptophan (W); 7) Serine (S), Threonine (I); and 8) Cysteine (C), Methionine (M). See, e.g., Creighton, Proteins (1984).
A “label” or a “detectable moiety” is a composition detectable by spectroscopic, photochemical, biochemical, immunochemical, chemical, or other physical means. For example, useful labels include 32P, fluorescent dyes, electron-dense reagents, enzymes (e.g., as commonly used in an ELISA), biotin, digoxigenin, or haptens and proteins which can be made detectable, e.g., by incorporating a radiolabel into the peptide or used to detect antibodies specifically reactive with the peptide.
The term “recombinant” when used with reference, e.g., to a cell, or nucleic acid, protein, or vector, indicates that the cell, nucleic acid, protein or vector, has been modified by the introduction of a heterologous nucleic acid or protein or the alteration of a native nucleic acid or protein, or that the cell is derived from a cell so modified. Thus, for example, recombinant cells express genes that are not found within the native (non-recombinant) form of the cell or express native genes that are otherwise abnormally expressed, under expressed or not expressed at all.
The phrase “stringent hybridization conditions” refers to conditions under which a probe will hybridize to its target subsequence, typically in a complex mixture of nucleic acids, but to no other sequences. Stringent conditions are sequence-dependent and will be different in different circumstances. Longer sequences hybridize specifically at higher temperatures. An extensive guide to the hybridization of nucleic acids is found in Tijssen, Techniques in Biochemistry and Molecular Biology-Hybridization with Nucleic Probes, “Overview of principles of hybridization and the strategy of nucleic acid assays” (1993). Generally, stringent conditions are selected to be about 5-10° C. lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength pH. The Tm, is the temperature(finder defined ionic strength, pH, and nucleic concentration) at which 50% of the probes complementary to the target hybridize to the target sequence at equilibrium (as the target sequences are present in excess, at Tm50% of the probes are occupied at equilibrium). Stringent conditions may also be achieved with the addition of destabilizing agents such as formamide. For selective or specific hybridization, a positive signal is at least two times background, preferably 10 times background hybridization. Exemplary stringent hybridization conditions can be as following: 50% formamide, 5×SSC, and 1% SDS, incubating at 42° C., or, 5×SSC, 1% SDS, incubating at 65° C., with wash in 0.2×SSC, and 0.1% SDS at 65° C.
Nucleic acids that do not hybridize to each other under stringent conditions are still substantially identical if the polypeptides which they encode are substantially identical. This occurs, for example, when a copy of a nucleic acid is created using the maximum codon degeneracy permitted by the genetic code. In such cases, the nucleic acids typically hybridize under moderately stringent hybridization conditions. Exemplary “moderately stringent hybridization conditions” include a hybridization in a buffer of 40% formamide, 1 M NaCl, 1% SDS at 37° C., and a wash in 1×SSC at 45° C. A positive hybridization is at least twice background. Those of ordinary skill will readily recognize that alternative hybridization and wash conditions can be utilized to provide conditions of similar stringency. Additional guidelines for determining hybridization parameters are provided in numerous reference, e.g., and Current Protocols in Molecular Biology, ed. Ausubel, et al., supra.
For PCR, a temperature of about 36° C. is typical for low stringency amplification, although annealing temperatures may vary between about 32° C. and 48° C. depending on primer length. For high stringency PCR amplification, a temperature of about 62° C. is typical, although high stringency annealing temperatures can range from about 50° C. to about 65° C., depending on the primer length and specificity. Typical cycle conditions for both high and low stringency amplifications include a denaturation phase of 90° C.-95° C. for 30 sec-2 min., an annealing phase lasting 30 sec.-2 min., and an extension phase of about 72° C. for 1-2 min. Protocols and guidelines for low and high stringency amplification reactions are provided, e.g., in Innis et al. (1990) PCR Protocols, A Guide to Methods and Applications, Academic Press, Inc, N.Y.).
“Antibody” refers to a polypeptide comprising a framework region from an immunoglobulin gene or fragments thereof that specifically binds and recognizes an antigen. The recognized immunoglobulin genes include the kappa, lambda, alpha, gamma, delta, epsilon, and mu constant region genes, as well as the myriad immunoglobulin variable region genes. Light chains are classified as either kappa or lambda. Heavy chains are classified as gamma, mu, alpha, delta, or epsilon, which in turn define the immunoglobulin classes, IgG, IgM, IgA, IgD and IgE, respectively. Typically, the antigen-binding region of an antibody will be most critical in specificity and affinity of binding. Antibodies can be polyclonal or monoclonal, derived from serum, a hybridoma or recombinantly cloned, and can also be chimeric, primatized, or humanized.
An exemplary immunoglobulin (antibody) structural unit comprises a tetramer. Each tetramer is composed of two identical pairs of polypeptide chains, each pair having one “light” (about 25 kDa) and one “heavy” chain (about 50-70 kDa). The N-terminus of each chain defines a variable region of about 100 to 110 or more amino acids primarily responsible for antigen recognition. The terms variable light chain (VL) and variable heavy chain (VH) refer to these light and heavy chains respectively.
Antibodies exist, e.g., as intact immunoglobulins or as a number of well-characterized fragments produced by digestion with various peptidases. Thus, for example, pepsin digests an antibody below the disulfide linkages in the hinge region to produce F(ab)′2, a dimer of Fab which itself is a light chain joined to VH—CH1 by a disulfide bond. The F(ab)′2 may be reduced under mild conditions to break the disulfide linkage in the hinge region, thereby converting the F(ab)′2 dimer into an Fab′ monomer. The Fab′ monomer is essentially Fab with part of the hinge region (see Fundamental Immunology (Paul ed., 3d ed. 1993). While various antibody fragments are defined in terms of the digestion of an intact antibody, one of skill will appreciate that such fragments may be synthesized de novo either chemically or by using recombinant DNA methodology. Thus, the term antibody, as used herein, also includes antibody fragments either produced by the modification of whole antibodies, or those synthesized de novo using recombinant DNA methodologies (e.g., single chain Fv) or those identified using phage display libraries (see, e.g., McCafferty et al., Nature 348:552-554 (1990)),
In one embodiment, the antibody is conjugated to an “effector” moiety. The effector moiety can be any number of molecules, including labeling moieties such as radioactive labels or fluorescent labels, or can be a therapeutic moiety. In one aspect the antibody modulates the activity of the protein.
The nucleic acids of the differentially expressed genes of this invention or their encoded polypeptides refer to all forms of nucleic acids (e.g., gene, pre-mRNA, mRNA) or proteins, their polymorphic variants, alleles, mutants, and interspecies homologs that (as applicable to nucleic acid or protein): (1) have an amino acid sequence that has greater than about 60% amino acid sequence identity, 65%, 70%, 75%, 80%, 85%, 90%, preferably 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% or greater amino acid sequence identity, preferably over a region of at least about 25, 50, 100, 200, 500, 1000, or more amino acids, to a polypeptide encoded by a referenced nucleic acid or an amino acid sequence described herein; (2) specifically bind to antibodies, e.g., polyclonal antibodies, raised against an immunogen comprising a referenced amino acid sequence, immunogenic fragments thereof, and conservatively modified variants thereof; (3) specifically hybridize under stringent hybridization conditions to a nucleic acid encoding a referenced amino acid sequence, and conservatively modified variants thereof; (4) have a nucleic acid sequence that has greater than about 95%, preferably greater than about 96%, 97%, 98%, 99%, or higher nucleotide sequence identity, preferably over a region of at least about 25, 50, 100, 200, 500, 1000, or more nucleotides, to a reference nucleic acid sequence. A polynucleotide or polypeptide sequence is typically from a mammal including, but not limited to, primate, e.g., human; rodent, e.g., rat, mouse, hamster; cow, pig, horse, sheep, or any mammal. The nucleic acids and proteins of the invention include both naturally occurring or recombinant molecules. Truncated and alternatively spliced forms of these antigens arc included in the definition.
The phrase “specifically (or selectively) binds” when referring to a protein, nucleic acid, antibody, or small molecule compound refers to a binding reaction that is determinative of the presence of the protein or nucleic acid, such as the differentially expressed genes of the present invention, often in a heterogeneous population of proteins or nucleic acids and other biologics. In the case of antibodies, under designated immunoassay conditions, a specified antibody may bind to a particular protein at least two times the background and more typically more than 10 to 100 times background. Specific binding to an antibody under such conditions requires an antibody that is selected for its specificity for a particular protein. For example, polyclonal antibodies can be selected to obtain only those polyclonal antibodies that are specifically immunoreactive with the selected antigen and not with other proteins. This selection may be achieved by subtracting out antibodies that cross-react with other molecules. A variety of immunoassay formats may be used to select antibodies specifically immunoreactive with a particular protein. For example, solid-phase ELISA immunoassays are routinely used to select antibodies specifically immunoreactive with a protein see, Harlow & Lane, Antibodies, A Laboratory Manual (1988) for a description of immunoassay formats and conditions that can be used to deter mine specific immunoreactivity).
The phrase “functional effects” in the context of assays for testing compounds that modulate a marker protein includes the determination of a parameter that is indirectly or directly under the influence of a biomarker of the invention, e.g., a chemical or phenotypic. A functional effect therefore includes ligand binding activity, transcriptional activation or repression, the ability of cells to proliferate, the ability to migrate, among others. “Functional effects” include in vitro, in vivo, and ex vivo activities.
By “determining the functional effect” is meant assaying for a compound that increases or decreases a parameter that is indirectly or directly under the influence of a biomarker of the invention, e.g., measuring physical and chemical or phenotypic effects. Such functional effects can be measured by any means known to those skilled in the art, e.g., changes in spectroscopic characteristics (e.g., fluorescence, absorbance, refractive index); hydrodynamic (e.g., shape), chromatographic; or solubility properties for the protein; ligand binding assays, e.g., binding to antibodies; measuring inducible markers or transcriptional activation of the marker; measuring changes in enzymatic activity; the ability to increase or decrease cellular proliferation, apoptosis, cell cycle arrest, measuring changes in cell surface markers. The functional effects can be evaluated by many means known to those skilled in the art, e.g., microscopy for quantitative or qualitative measures of alterations in morphological features, measurement of changes in RNA or protein levels for other genes expressed in placental tissue, measurement of RNA stability, identification of downstream or reporter gene expression (CAT, luciferase, β-gal, GFP and the like), e.g., via chemiluminescence, fluorescence, colorimetric reactions, antibody binding, inducible markers, etc.
“Inhibitors,” “activators,” and “modulators” of the markers are used to refer to activating, inhibitory, or modulating molecules identified using in vitro and in vivo assays of cancer biomarkers. Inhibitors are compounds that, e.g., bind to, partially or totally block activity, decrease, prevent, delay activation, inactivate, desensitize, or down regulate the activity or expression of cancer biomarkers. “Activators” are compounds that increase, open, activate, facilitate, enhance activation, sensitize, agonize, or up regulate activity of cancer biomarkers, e.g., agonists. Inhibitors, activators, or modulators also include genetically modified versions of cancer biomarkers, e.g., versions with altered activity, as well as naturally occurring and synthetic ligands, antagonists, agonists, antibodies, peptides, cyclic peptides, nucleic acids, antisense molecules, ribozymes, RNAi and siRNA molecules, small organic molecules and the like. Such assays for inhibitors and activators include, e.g., expressing cancer biomarkers in vitro, in cells, or cell extracts, applying putative modulator compounds, and then determining the functional effects on activity, as described above.
Samples or assays comprising cancer biomarkers that are treated with a potential activator, inhibitor, or modulator are compared to control samples without the inhibitor, activator, or modulator to examine the extent of inhibition. Control samples (untreated with inhibitors) are assigned a relative protein activity value of 100%. Inhibition of cancer biomarkers is achieved when the activity value relative to the control is about. 80%, preferably 50%, more preferably 25-0%. Activation of cancer biomarkers is achieved when the activity value relative to the control (untreated with activators) is 110%, more preferably 150%, more preferably 200-500% (i.e., two to five fold higher relative to the control), more preferably 1000-3000% higher.
The term “test compound” or “drug candidate” or “modulator” or grammatical equivalents as used herein describes any molecule, either naturally occurring or synthetic, e.g., protein, oligopeptide (e.g., from about 5 to about 25 amino acids in length, preferably from about 10 to 20 or 12 to 18 amino acids in length, preferably 12, 15, or 18 amino acids in length), small organic molecule, polysaccharide, peptide, circular peptide, lipid, fatty acid, siRNA, polynucleotide, oligonucleotide, etc., to be tested for the capacity to directly or indirectly modulate cancer biomarkers. The test compound can be in the form of a library of test compounds, such as a combinatorial or randomized library that provides a sufficient range of diversity. Test compounds are optionally linked to a fusion partner, e.g., targeting compounds, rescue compounds, dimerization compounds, stabilizing compounds, addressable compounds, and other functional moieties. Conventionally, new chemical entities with useful properties are generated by identifying a test compound (called a “lead compound”) with some desirable property or activity, e.g., inhibiting activity, creating variants of the lead compound, and evaluating the property and activity of those variant compounds. Often, high throughput screening (HTS) methods are employed for such an analysis.
A “small organic molecule” refers to an organic molecule, either naturally occurring or synthetic, that has a molecular weight of more than about 50 daltons and less than about 2500 daltons, preferably less than about 2000 daltons, preferably between about 100 to about 1000 daltons, more preferably between about 200 to about 500 daltons.
The present invention provides methods of predicting or providing prognosis for lung cancer by detecting the expression of a panel of markers differentially expressed in the cancer. Examples of such panels would include some or all of the genes from ctnnb1, wnt3a, tp53, kras, erbb3, muc1, erbb2, and dusp6, or some or all of the genes rnd3, wnt3a, erbb3, lck, sh3bgr, fut3, il11, cdc6, cdk2ap1, bag1, emx2, six3, and brca1 (e.g., wnt3a, rnd3, lck, and erbb3). Prediction and prognosis involve determining the level of a panel of lung cancer biomarker polynucleotide or the corresponding polypeptides in a patient or patient sample and then comparing the level to a baseline or range. Typically, the baseline value is representative of levels of the polynucleotide or nucleic acid in a healthy person not suffering from, or destined to develop, lung cancer, as measured using a biological sample such as a lung biopsy or a sample of a bodily fluid. Variation of levels of a polynucleotide or corresponding polypeptides of the invention from the baseline range (either up or down) indicates that the patient has an increased risk of long term mortality.
In a preferred embodiment, real time or quantitative PCR is used to examine expression of the eight biomarkers in the panel using RNA from a biological sample such as tumor tissue. No microdissection is required. RNA extraction can be performed by any method know to those of skill in the art, e.g., using Trizol and RNeasy. Real time PCR can be performed by any method known to those of skill in the art, e.g., Taqman real time PCR using Applied Biosystem assays. Gene expression is calculated relative to pooled normal lung RNA, and expression is normalized to housekeeping genes. Suitable oligonucleotide primers are selected by those of skill in the art. In one embodiment, the assay is used for stage I stage II, stage III, or stage IV cancers. In one embodiment, the tissue sample is from a surgically resected tumor.
In one embodiment, RNA biomarkers are examined using nucleic acid binding molecules such as probes, oligonucleotides, oligonucleatide arrays, and primers to detect differential RNA expression in patient samples. In one embodiment, RT-PCR is used according to standard methods known in the art. In another embodiment, quantitative PCR assays such as Taqman® assays available from, e.g., Applied Biosystems, can be used to detect nucleic acids and variants thereof. In other embodiments, nucleic acid microarrays can be used to detect nucleic acids. Analysis of nucleic acids can be achieved using routine techniques such as northern analysis, or any other methods based on hybridization to a. nucleic acid sequence that is complementary to a portion of the marker coding sequence (e.g., slot blot hybridization) are also within the scope of the present invention. Reagents that bind to selected nucleic acid biomarkers can be prepared according to methods known to those of skill in the art or purchased commercially.
Applicable PCR amplification techniques are described in, e.g., Ausubel et al. and Innis et al., supra. General nucleic acid hybridization methods are described in Anderson, “Nucleic Acid Hybridization,” BIOS Scientific Publishers, 1999. Amplification or hybridization of a plurality of nucleic acid sequences (e.g., genomic DNA, mRNA or cDNA) can also be performed from mRNA or cDNA sequences arranged in a microarray. Microarray methods are generally described in Hardiman, “Microarrays Methods and Applications: Nuts & Bolts,” DNA Press, 2003; and Baldi et al., “DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling,” Cambridge University Press, 2002.
Analysis of nucleic acid markers can be performed using techniques known in the art including, without limitation, sequence analysis, and electrophoretic analysis. Non-limiting examples of sequence analysis include Maxam-Gilbert sequencing, Sanger sequencing, capillary array DNA sequencing, thermal cycle sequencing (Sears et al., Biotechniques, 13:626-633 (1992)), solid-phase sequencing (Zimmerman et al., Methods Mol. Cell Biol., 3:39-42 (1992)), sequencing with mass spectrometry such as matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF/MS; Fu et al., Nat. Biotechnol., 16:381-384 (1998)), and sequencing by hybridization. Chee et al., Science, 274:610-614 (1996); Drmanac et al., Science, 260:1649-1652 (1993); Drmanac et al., Nat. Biotechnol., 16:54-58 (1998). Non-limiting examples of electrophoretic analysis include slab gel electrophoresis such as agarose or polyacrylamide gel electrophoresis, capillary electrophoresis, and denaturing gradient gel electrophoresis.
In another embodiment, antibody reagents can be used in assays to detect expression levels of protein biomarkers of the invention in patient samples using any of a number of immunoassays known to those skilled in the art. Immunoassay techniques and protocols are generally described in Price and Newman, “Principles and Practice of Immunoassay,” 2nd Edition, Grove's Dictionaries, 1997; and Gosling, “Immunoassays: A Practical Approach,” Oxford University Press, 2000. A variety of immunoassay techniques, including competitive and non-competitive immunoassays, can be used. See, e.g., Self et al., Curr. Opin. Biotechnol., 7:60-65 (1996). The term immunoassay encompasses techniques including, without limitation, enzyme immunoassays (EIA) such as enzyme multiplied immunoassay technique (EMIT), enzyme-linked immunosorbent assay (ELISA), IgM antibody capture ELISA (MAC ELISA), and microparticle enzyme immunoassay (MEIA); capillary electrophoresis immunoassays (CEIA); radioimmunoassays (RIA); immunoradiometric assays (IRMA); fluorescence polarization immunoassays (FPIA); and chemiluminescence assays (CL). If desired, such immunoassays can be automated. Immunoassays can also be used in conjunction with laser induced fluorescence. See, e.g., Schmalzing et al., Electrophoresis, 18:2184-93 (1997); Bao, J. Chromatogr. B. Biomed. Sci., 699:463-80 (1997). Liposome immunoassays, such as flow-injection liposome immunoassays and liposome immunosensors, are also suitable for use in the present invention. See, e.g., Rongen et al., J. Immunol. Methods, 204:105-133 (1997). In addition, nephelometry assays, in which the formation of protein/antibody complexes results increased light scatter that is converted to a peak rate signal as a function of the marker concentration, are suitable for use in the methods of the present invention. Nephelometry assays are commercially available from Beckman. Coulter (Brea, Calif.; Kit #449430) and can be performed using a Behring Nephelometer Analyzer (Fink et al., J. Clin. Chem. Clin. Biochem., 27:261-276 (1989)).
A detectable moiety can be used in the assays described herein (direct or indirect detection). A wide variety of detectable moieties can be used, with the choice of label depending on the sensitivity required, ease of conjugation with the antibody, stability requirements, and available instrumentation and disposal provisions. Suitable detectable moieties include, but are not limited to, radionuclides, fluorescent dyes (e.g., fluorescein, fluorescein isothiocyanate (FITC), Oregon Green™, rhodamine, Texas red, tetrarhodimine isothiocynate (TRITC), Cy3, Cy5, etc.), fluorescent markers (e.g., green fluorescent protein (GFP), phycoerythrin, etc.), autoquenched fluorescent compounds that are activated by tumor-associated proteases, enzymes (e.g., luciferase, horseradish peroxidase, alkaline phosphatase, etc.), nanoparticles, biotin, digoxigenin, metals, and the like.
A chemiluminescence assay using a chemiluminescent antibody specific for the nucleic acid is suitable for sensitive, non-radioactive detection of protein levels. An antibody labeled with fluorochrome is also suitable. Examples of fluorochromes include, without limitation, DAPI, fluorescein, Hoechst 33258, R-phycocyanin, B-phycoerythrin, R-phycoerythrin, rhodamine, Texas red, and lissamine. Indirect labels include various enzymes well known in the art, such as horseradish peroxidase (HRP), alkaline phosphatase (AP), β-galactosidase, urease, and the like. A horseradish-peroxidase detection system can be used, for example, with the chromogenic substrate tetramethylbenzidine (TMB), which yields a soluble product in the presence of hydrogen peroxide that is detectable at 450 nm. An alkaline phosphatase detection system can be used with the chromogenic substrate p-nitrophenyl phosphate, for example, which yields a soluble product readily detectable at 405 nm. Similarly, a β-galactosidase detection system can be used with the chromogenic substrate o-nitrophenyl-β-galactopyranoside (ONPG), which yields a soluble product detectable at 410 nm. An urease detection system can be used with a substrate such as urea-bromocresol purple (Sigma Irnmunochemicals; St. Louis, Mo.).
A signal from the direct or indirect label can be analyzed, for example, using a spectrophotometer to detect color from a chromogenic substrate; a radiation counter to detect radiation such as a gamma counter for detection of 125I; or a fluorometer to detect fluorescence in the presence of light of a certain wavelength. For detection of enzyme-linked antibodies, a quantitative analysis can be made using a spectrophotometer such as an EMAX Microplate Reader (Molecular Devices; Menlo Park, Calif.) In accordance with the manufacturer's instructions. If desired, the assays of the present invention can be automated or performed robotically, and the signal from multiple samples can be detected simultaneously.
The antibodies can be immobilized onto a variety of solid supports, such as magnetic or chromatographic matrix particles, the surface of an assay plate (e.g., microtiter wells), pieces of a solid substrate material or membrane (e.g., plastic, nylon, paper), and the like. An assay strip can be prepared by coating the antibody or a plurality of antibodies in an array on a solid support. This strip can then be dipped into the test sample and processed quickly through washes and detection steps to generate a measurable signal, such as a colored spot.
Useful physical formats comprise surfaces having a plurality of discrete, addressable locations for the detection of a plurality of different markers. Such formats include microarrays and certain capillary devices. See, e.g., Ng et al., J. Cell Mol. Med., 6:329-340 (2002); U.S. Pat. No. 6,019,944. In these embodiments, each discrete surface location may comprise antibodies to immobilize one or more markers for detection at each location. Surfaces may alternatively comprise one or more discrete particles (e.g., microparticles or nanoparticles) immobilized at discrete locations of a surface, where the microparticles comprise antibodies to immobilize one or more markers for detection.
Analysis can be carried out in a variety of physical formats. For example, the use of microtiter plates or automation could be used to facilitate the processing of large numbers of test samples. Alternatively, single sample formats could be developed to facilitate diagnosis or prognosis in a timely fashion.
Alternatively, the antibodies or nucleic acid probes of the invention can be applied to sections of patient biopsies immobilized on microscope slides. The resulting antibody staining or in situ hybridization pattern can be visualized using any one of a variety of light or fluorescent microscopic methods known in the art.
In another format, the various markers of the invention also provide reagents for in vivo imaging such as, for instance, the imaging of labeled regents that detect the nucleic acids or encoded proteins of the biomarkers of the invention. For in vivo imaging purposes, reagents that detect the presence of proteins encoded by cancer biomarkers, such as antibodies, may be labeled using an appropriate marker, such as a fluorescent marker.
In another aspect, the invention features a report indicating a prognosis of a subject with cancer. The report can, for example, be in electronic or paper form. The report can include basic patient information, including a subject identifier (e.g., the subject's name, a social security number, a medical insurance number, or a randomly generated number), physical characteristics of the subject (e.g., age, weight, or sex), the requesting physician's name, the date the prognosis was generated, and the date of sample collection. The reported prognosis can relate to likelihood of survival for a certain period of time, likelihood of response to certain treatments within a certain period of time (e.g., chemotherapeutic or surgical treatments), and/or likelihood of reoccurrence of cancer. The reported prognosis can be in the form of a percentage chance of survival for a certain period of time, percentage chance of favorable response to treatment (favorable response can be defined, e.g., tumor shrinkage or slowing of tumor growth), or reoccurrence over a defined period of time (e.g., 20% chance of survival over a five year period). The reported prognosis can alternatively be in the form of a calculated score. A greater or lower score, for example, can be indicative of a favorable prognosis, In another embodiment, the reported prognosis can be a general description of the likelihood of survival, response to treatment, or reoccurrence over a period of time (e.g., very likely, likely, or unlikely to survive for five years). In another embodiment, the reported prognosis can be in the form of a graph. In addition to the gene expression levels, the reported prognosis may also take into account additional characteristics of the subject (e.g., age, stage of cancer, gender, previous treatment, fitness, cardiovascular health, and mental health).
In addition to a prognosis, the report can optionally include raw data concerning the expression level at least two genes selected from Rnd3, wnt3a, erbb3, lck, sh3bgr, fut3, il11, cdc6, cdk2ap1, bag1, ernx2, six3, and brca1.
The invention provides compositions, kits and integrated systems for practicing the assays described herein using antibodies specific for the polypeptides or nucleic acids specific for the polynucleotides of the invention.
Kits for carrying out the diagnostic assays of the invention typically include a probe that comprises an antibody or nucleic acid sequence that specifically binds to polypeptides or polynucleotides of the invention, and a label for detecting the presence of the probe. The kits may include several antibodies or polynucleotide sequences encoding polypeptides of the invention, e.g., a cocktail of antibodies that recognize the proteins encoded by the biomarkers of the invention.
The following examples are offered to illustrate, but not to limit the claimed invention.
The present invention analyzed data on 105 patients with surgically resected lung adenocarcinoma who had at least 3 years of follow up and had high quality tissue accessible. Real-time PCR was run on 65 genes and housekeeping genes (18S and POL2RA, see also Table 1). PCR output (in cycle thresholds, CT) was normalized to commercially available pooled normal lung RNA and 18S as the housekeeping gene. The resulting values (as shown in Table 2) are proportion of expression relative to the normal lung RNA. In Table 2, the genes are arranged in rows, and the genes are abbreviated. The number before each gene (e.g. 7-FZD) is a number we used to keep track of genes and has no other bearing. These values were then log transformed (the second set of values in Table 2).
The analysis was done as follows. The samples (patients) were divided into 2 sets: a training set (n=70) and a validation set (n=35) randomly. Cox modeling was performed on each gene individually in the training set, and a liberal P value threshold of 0.2 for survival was used to select genes. We then did backwards selection of all these genes together on the training set, that is dropping genes one-by-one based on P value criteria of 0.2, and stopping when dropping that gene had a meaningful change in one or more of the remaining gene's coefficients. The final model contained the following eight genes: ctnnb1, wnt3a, tp53, kras, erbb3, muc1, erbb2, and dusp6. Individual sample expression values were then plugged into the model to arrive at a score for each sample. We chose to set the cutpoint for a high score as the top tertile of scores. Kaplan-Meier curves were generated for all samples and also for stage I patients only. This was repeated in the validation set. A cox model was run including tumor size and tumor stage and a high-score predicted a higher risk of death after adjusting for these other variables.
Ideally, a prognostic tool should provide accurate risk stratification, should be clinically feasible to employ in day-to-day practice, and should be cost effective. Such an assay would be of particular benefit to patients with surgically resected stage I NSCLC. The current standard of care for stage I NSCLC is lobectomy and mediastinal lymph node dissection, without adjuvant chemotherapy. Better identification of good prognosis patient subsets might allow lesser surgical procedures to be employed with equal survival potential. Conversely, stage I subsets with a poor prognosis could be selected for inclusion into clinical trials testing novel approaches and new therapeutic agents. Considering the current limitations of chemotherapy in stage I disease, a bioassay that is both prognostic and predictive of chemotherapy benefit would be especially beneficial. Lastly, stage I NSCLC is likely to be of increasing importance in the future. While approximately twenty percent of patients currently diagnosed with NSCLC are stage I, this proportion probably will grow due to the recent advent of lung cancer screening by computerized tomography.
We have found that the expression of Wnt3a, md3, lck, and erbb3 is prognostic of survival in lung cancer patients. Clinical and histologic characteristics of patients studied are listed in Table 3. We first analyzed the entire data set, and subsequently restricted the analysis to stage I patients. Cross-validation, after forced adjustment for age, tumor size, and stage, supported a model containing 4 genes, selected in the following order: Wnt3a, md3, lck, and erbb3. Overall survival (OS) was analyzed utilizing the risk score for the model utilizing these four genes (with L1shrunken coefficients). After adjusting for patient age, disease stage, and tumor size, the hazard ratio (HR) for the risk score as a continuous variable was 6.7 (95% CI: 1.6-28.9, p=0.005), as listed in Table 4.
Survival curves for the entire study cohort based on dichotomized risk scores from the 4 gene model are presented in
Applying the 5-gene model of Chen and colleagues to our study population, the HR adjusted for age, stage, and tumor size was 2.4 (95% CI: 1.3-4.3, p=0.0047) 13. (Chen et al. N Engl J Med 356:11-20 (2007). Among all patients, 5-year OS was 65% among low-risk and 43% among high risk patients (p=0.045). Among stage I patients, 5-year OS was 80% among low-risk and 47% among high risk patients (p=0.022),
We identified a risk score based on the gene expression of 4-genes by quantitative PCR that is prognostic of long-term survival in patients with completely surgically resected lung adenocarcinoma. This gene expression-based score predicted survival (adjusted HR 6.7) and disease recurrence better than clinical stage and tumor size. Our model best predicted risk of death among patients with stage I disease. In our cross-validated cohort, low-and high-risk scores were associated with 5-year overall survival of 87% and 38% among stage I patients.
We utilized cross-validation both for model selection and for model validation. Rather than splitting data into test and validation sets, cross-validation utilizes repeated data-splitting to prevent model over-fitting and to generate accurate estimates of model coefficients. Cross-validation allows for the generation of validated model coefficients, but utilizes data more efficiently than simple data-splitting.
This prognostic model has important clinical implications for diagnosis and providing a prognosis for lung cancer, e.g., as a tool to complement clinical staging in risk-stratifying patients with stage I lung adenocarcinoma. The current standard of care for patients with stage I NSCLC is surgical resection alone, typically lobectomy. Conversely, post-resection patients at low risk for recurrence may be best served by observation alone, and may even be candidates for less radical lung resection.
In our patient population, the prognostic value of our model compares favorably to that of Chen and colleagues (Id.). There are several possible explanations for this finding. First, candidate genes for our model were selected based on previously published genome-wide expression microarray studies, whereas genes in the Chen model were selected from a more limited custom expression array platform. Additionally, a relatively large set of genes were assessed by RT-PCR before finalizing our gene model. Moreover, our patient follow-up time was relatively long, with a minimum follow-up period of 3 years and a median follow-up of 61 months. This factor may have allowed us to more accurately detect differences in survival between risk groups.
Between. January 1997 and June 2004, 120 patients who had undergone complete surgical resection of lung adenocarcinoma, and had fresh-frozen tissue banked for genomic analysis, were entered into the study. Eligible patients had undergone surgical resection with curative intent and had adequate mediastinal lymph node staging. Patients who received pre-operative chemotherapy were excluded from the study, so as not to confound development of a purely prognostic tool. Thirteen patients were excluded due to insufficient banked tissue, inadequate RNA quality, or weak expression of housekeeping genes. The primary endpoint was overall survival. Disease-free survival (DFS) was defined as the time from surgery until radiographic evidence of recurrent disease or time until the last documented physician follow-up visit in the absence of recurrent disease. Patients consented to tissue specimen collection prospectively, and the study was approved by the UCSF institutional review board (CHR #H8714-28880-01).
Of 217 genes identified from previously published microarray and PCR-based studies of prognosis in early stage lung cancer (see, for example, Beer et al. Nat Med 8:816 (2002); Potti et al. N Engl J Med 355:570 (2006); Wigle et al. Cancer Res 62:3005 (2002); Garber et al. Proc Natl Acad Sci USA 98:13784 (2001); Miura et al. Cancer Res 62:3244 (2002); and Schneider et al. Br J Cancer 83:473 (2000)), 76 cancer-related genes were identified by content experts or literature review. 15 genes were excluded due to non-functioning primers or very weak expression in tumor tissue by RT-PCR in a pilot study. The remaining 61 genes are listed in Table 5.
All tissue was frozen in liquid nitrogen at the time of the operation. Tissue was macrodissected into 1 cm3 sections which was ground in liquid nitrogen. RNA was extracted using a TriZol (Invitrogen, Carlsbad, Calif.) extraction protocol. Taqman RT-PCR was performed on cDNA in 384-well plates using Prism 7900HT machine (Applied Biosystems, Foster City, Calif.). The expression of each gene was assayed in triplicate. Samples were compared to commercially available pooled normal lung RNA (Clontech, Mountain View, Calif.), and normalized to 18S ribosomal rRNA (Applied Biosystems).
Approximately 1 cm-cube blocks of tumor specimen were obtained from fresh lung resections and immediately snap-frozen in liquid nitrogen. Frozen tissues were stored at −80° C. Total RNA was extracted from 131 individual tumor specimens using a TriZol (Invitrogen) extraction protocol. Quality of RNA samples was verified using Agilent 2100 BioAnalyzer™ (Agilent), and RNA-concentrations of each sample were determined by Nanodrop ND-1000™ (Nanodrop). First-strand cDNA was synthesized off 3 μg of total RNA templates using the i Script cDNA Synthesis Kit (BioRad), and the expression levels corresponding to each of 86 different gene transcripts were measured by real-time quantitative PCR using an ABI PRISM 7900 HT Sequence Detection System with automation accessory (Applied Biosystems). Each 20 μL reaction, performed in triplicate, consisted of TaqMan Universal PCR Master Mix (Applied Biosystems), the appropriate Taqman Gene Expression Assay (Applied Biosystems), and 1.8 μL cDNA template corresponding to 120 ng of total RNA per reaction.
Ct values were obtained using the Sequence Detection System (SDS) 2.3 software (Applied Biosystems) and relative gene expression values were calculated as ΔΔCt, which yields the relative expression level of a target gene normalized to that of an endogenous reference gene relative to a calibrator sample (the reference for all samples). The calibrator sample used was cDNA synthesized from normal human lung total RNA supplied by Clontech (Catalog #636524). To select an endogenous reference gene, expression levels in lung cancer (both tumor and normal tissues) of a number of commonly used reference genes published earlier were compared. Three genes—encoding the 18S ribosomal RNA subunit, POL2RA RNA polymerase, and TBP TATA box protein—were selected and their expression levels were compared across dilutions of several cDNA specimens from both tumor and normal samples. 18S rRNA exhibited the most stable expression across lung samples and normal), and therefore raw gene expression values were normalized to that of 18S rRNA.
The salient features of our data structure are (i) a right censored survival endpoint (death) with modest event numbers (47 -after subsetting to exclude missings); (ii) a multitude (61) of predictors as constituted by the (log) gene expression values obtained from RT-QPCR (delta-delta Cts described elsewhere); and (iii) select clinical and demographic covariates. In view of these features and dimensions the primary data analytic tool we employed was L1 penalized Cox proportional hazards regression. This methodology extends the simultaneous coefficient shrinkage and predictor selection that is inherent in L1 penalization, where it has proven highly effective, to the survival data setting. Earlier extensions, largely motivated by microarray gene expression applications, were either computationally prohibitive or reliant on approximation. We used cross-validation to determine model size; i.e., number of selected genes. A risk score was then generated for each subject based on model coefficients. Resultant predicted risk scores were dichotomized (at the median) and corresponding Kaplan-Meier survival curves displayed and compared via the log-rank statistic. All analyses were conducted using the statistical package R (Version 2.3.1, 2006).
It is understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.
1HR = Hazard ratio; CI = confidence interval
2Risk score analyzed as a continuous variable
3HR for stage is stage I patients compared to all other patients
The present application claims priority to U.S. Ser. No. 60/941,550, filed Jun. 1, 2007, herein incorporated by reference in its entirety, NOT APPLICABLE NOT APPLICABLE
Number | Date | Country | |
---|---|---|---|
60941550 | Jun 2007 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12602654 | Jun 2010 | US |
Child | 13668192 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13668192 | Nov 2012 | US |
Child | 15453864 | US |