The present invention relates to compositions and methods for cancer diagnosis, research and therapy, including but not limited to, cancer markers. In particular, the present invention relates to pseudogenes as diagnostic markers and clinical targets for cancer.
Afflicting one out of nine men over age 65, prostate cancer (PCA) is a leading cause of male cancer-related death, second only to lung cancer (Abate-Shen and Shen, Genes Dev 14:2410 [2000]; Ruijter et al., Endocr Rev, 20:22 [1999]). The American Cancer Society estimates that about 184,500 American men will be diagnosed with prostate cancer and 39,200 will die in 2001.
Prostate cancer is typically diagnosed with a digital rectal exam and/or prostate specific antigen (PSA) screening. An elevated serum PSA level can indicate the presence of PCA. PSA is used as a marker for prostate cancer because it is secreted only by prostate cells. A healthy prostate will produce a stable amount—typically below 4 nanograms per milliliter, or a PSA reading of “4” or less—whereas cancer cells produce escalating amounts that correspond with the severity of the cancer. A level between 4 and 10 may raise a doctor's suspicion that a patient has prostate cancer, while amounts above 50 may show that the tumor has spread elsewhere in the body.
When PSA or digital tests indicate a strong likelihood that cancer is present, a transrectal ultrasound (TRUS) is used to map the prostate and show any suspicious areas. Biopsies of various sectors of the prostate are used to determine if prostate cancer is present. Treatment options depend on the stage of the cancer. Men with a 10-year life expectancy or less who have a low Gleason number and whose tumor has not spread beyond the prostate are often treated with watchful waiting (no treatment). Treatment options for more aggressive cancers include surgical treatments such as radical prostatectomy (RP), in which the prostate is completely removed (with or without nerve sparing techniques) and radiation, applied through an external beam that directs the dose to the prostate from outside the body or via low-dose radioactive seeds that are implanted within the prostate to kill cancer cells locally. Anti-androgen hormone therapy is also used, alone or in conjunction with surgery or radiation. Hormone therapy uses luteinizing hormone-releasing hormones (LH-RH) analogs, which block the pituitary from producing hormones that stimulate testosterone production. Patients must have injections of LH-RH analogs for the rest of their lives.
While surgical and hormonal treatments are often effective for localized PCA, advanced disease remains essentially incurable. Androgen ablation is the most common therapy for advanced PCA, leading to massive apoptosis of androgen-dependent malignant cells and temporary tumor regression. In most cases, however, the tumor reemerges with a vengeance and can proliferate independent of androgen signals.
The advent of prostate specific antigen (PSA) screening has led to earlier detection of PCA and significantly reduced PCA-associated fatalities. However, the impact of PSA screening on cancer-specific mortality is still unknown pending the results of prospective randomized screening studies (Etzioni et al., J. Natl. Cancer Inst., 91:1033 [1999]; Maattanen et al., Br. J. Cancer 79:1210 [1999]; Schroder et al., J. Natl. Cancer Inst., 90:1817 [1998]). A major limitation of the serum PSA test is a lack of prostate cancer sensitivity and specificity especially in the intermediate range of PSA detection (4-10 ng/ml). Elevated serum PSA levels are often detected in patients with non-malignant conditions such as benign prostatic hyperplasia (BPH) and prostatitis, and provide little information about the aggressiveness of the cancer detected. Coincident with increased serum PSA testing, there has been a dramatic increase in the number of prostate needle biopsies performed (Jacobsen et al., JAMA 274:1445 [1995]). This has resulted in a surge of equivocal prostate needle biopsies (Epstein and Potter J. Urol., 166:402 [2001]). Thus, development of additional serum and tissue biomarkers to supplement PSA screening is needed.
The present invention relates to compositions and methods for cancer diagnosis, research and therapy, including but not limited to, cancer markers. In particular, the present invention relates to pseudogenes as diagnostic markers and clinical targets for cancer.
Embodiments of the present invention provide compositions, kits, and methods useful in the detection and screening of prostate and breast cancer.
For example, embodiments of the present invention provide a method of screening for the presence of breast cancer in a subject, comprising contacting a biological sample from a subject with a reagent for detecting the level of expression of a pseudogene (e.g., ATPase, aminophospholipid transporter, class I, type 8A, member 2 pseudogene (ATP8A2-Ψ) or dipeptidyl-peptidase 3 (DPP3)); and detecting the level of expression of the pseudogene in the sample using an in vitro assay, wherein an increased level of expression of the pseudogene in the sample relative to the level in normal breast cells is indicative of breast cancer in the subject. In some embodiments, the sample is, for example, blood, plasma, serum or breast cells. In some embodiments, detection is carried out utilizing a method selected from, for example, a sequencing technique, a nucleic acid hybridization technique, a nucleic acid amplification technique (e.g., polymerase chain reaction, reverse transcription polymerase chain reaction, transcription-mediated amplification, ligase chain reaction, strand displacement amplification or nucleic acid sequence based amplification) or an immunoassay. In some embodiments, the reagent is, for example, a pair of amplification oligonucleotides or an oligonucleotide probe. In some embodiments, the breast cancer is luminal breast cancer.
In some embodiments, the present invention provides a method of screening for the presence of prostate cancer in a subject, comprising contacting a biological sample from a subject with a reagent for detecting the level of expression of a pseudogene (e.g., coxsackie virus and adenovirus receptor pseudogene (CXADR-Ψ), NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 9 (NDUFA9), epithelial cell adhesion molecule (EPCAM), PDGFA associated protein 1 (PDAP1), RNA binding motif protein 17 (RBM17), carboxylesterase 5A (CES7) or kallikrein-related peptidase 4-kallikrein pseudogene 1 (KLK4-KLKP1)); and detecting the level or presence of expression of the pseudogene in the sample using an in vitro assay, wherein the presence or an increased level of expression of the pseudogene in the sample relative to the level in normal prostate cells is indicative of prostate cancer in the subject. In some embodiments, the sample is, for example, tissue, blood, plasma, serum, urine, urine supernatant, urine cell pellet, semen, prostatic secretions or prostate cells. In some embodiments, detection is carried out utilizing a method selected from, for example, a sequencing technique, a nucleic acid hybridization technique, a nucleic acid amplification technique (e.g., polymerase chain reaction, reverse transcription polymerase chain reaction, transcription-mediated amplification, ligase chain reaction, strand displacement amplification or nucleic acid sequence based amplification) or an immunoassay. In some embodiments, the reagent is, for example, a pair of amplification oligonucleotides or an oligonucleotide probe. In some embodiments, the cancer is localized prostate cancer or metastatic prostate cancer.
Additional embodiments are described herein.
To facilitate an understanding of the present invention, a number of terms and phrases are defined below:
As used herein, the terms “detect”, “detecting” or “detection” may describe either the general act of discovering or discerning or the specific observation of a detectably labeled composition.
As used herein, the term “subject” refers to any organisms that are screened using the diagnostic methods described herein. Such organisms preferably include, but are not limited to, mammals (e.g., murines, simians, equines, bovines, porcines, canines, felines, and the like), and most preferably includes humans.
The term “diagnosed,” as used herein, refers to the recognition of a disease by its signs and symptoms, or genetic analysis, pathological analysis, histological analysis, and the like.
A “subject suspected of having cancer” encompasses an individual who has received an initial diagnosis (e.g., a CT scan showing a mass or increased PSA level) but for whom the stage of cancer or presence or absence of pseudogenes indicative of cancer is not known. The term further includes people who once had cancer (e.g., an individual in remission). In some embodiments, “subjects” are control subjects that are suspected of having cancer or diagnosed with cancer.
As used herein, the term “characterizing cancer in a subject” refers to the identification of one or more properties of a cancer sample in a subject, including but not limited to, the presence of benign, pre-cancerous or cancerous tissue, the stage of the cancer, and the subject's prognosis. Cancers may be characterized by the identification of the expression of one or more cancer marker genes, including but not limited to, the pseudogenes disclosed herein.
As used herein, the term “characterizing prostate tissue in a subject” refers to the identification of one or more properties of a prostate tissue sample (e.g., including but not limited to, the presence of cancerous tissue, the presence or absence of pseudogenes, the presence of pre-cancerous tissue that is likely to become cancerous, and the presence of cancerous tissue that is likely to metastasize). In some embodiments, tissues are characterized by the identification of the expression of one or more cancer marker genes, including but not limited to, the cancer markers disclosed herein.
As used herein, the term “stage of cancer” refers to a qualitative or quantitative assessment of the level of advancement of a cancer. Criteria used to determine the stage of a cancer include, but are not limited to, the size of the tumor and the extent of metastases (e.g., localized or distant).
As used herein, the term “nucleic acid molecule” refers to any nucleic acid containing molecule, including but not limited to, DNA or RNA. The term encompasses sequences that include any of the known base analogs of DNA and RNA including, but not limited to, 4-acetylcytosine, 8-hydroxy-N6-methyladenosine, aziridinylcytosine, pseudoisocytosine, 5-(carboxyhydroxylmethyl) uracil, 5-fluorouracil, 5-bromouracil, 5-carboxymethylaminomethyl-2-thiouracil, 5-carboxymethyl-aminomethyluracil, dihydrouracil, inosine, N6-isopentenyladenine, 1-methyladenine, 1-methylpseudouracil, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-methyladenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine, 5′-methoxycarbonylmethyluracil, 5-methoxyuracil, 2-methylthio-N6-isopentenyladenine, uracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid, oxybutoxosine, pseudouracil, queosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil, N-uracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid, pseudouracil, queosine, 2-thiocytosine, and 2,6-diaminopurine.
The term “gene” refers to a nucleic acid (e.g., DNA) sequence that comprises coding sequences necessary for the production of a polypeptide, precursor, or RNA (e.g., rRNA, tRNA). The polypeptide can be encoded by a full length coding sequence or by any portion of the coding sequence so long as the desired activity or functional properties (e.g., enzymatic activity, ligand binding, signal transduction, immunogenicity, etc.) of the full-length or fragments are retained. The term also encompasses the coding region of a structural gene and the sequences located adjacent to the coding region on both the 5′ and 3′ ends for a distance of about 1 kb or more on either end such that the gene corresponds to the length of the full-length mRNA. Sequences located 5′ of the coding region and present on the mRNA are referred to as 5′ non-translated sequences. Sequences located 3′ or downstream of the coding region and present on the mRNA are referred to as 3′ non-translated sequences. The term “gene” encompasses both cDNA and genomic forms of a gene. A genomic form or clone of a gene contains the coding region interrupted with non-coding sequences termed “introns” or “intervening regions” or “intervening sequences.” Introns are segments of a gene that are transcribed into nuclear RNA (hnRNA); introns may contain regulatory elements such as enhancers. Introns are removed or “spliced out” from the nuclear or primary transcript; introns therefore are absent in the messenger RNA (mRNA) transcript. The mRNA functions during translation to specify the sequence or order of amino acids in a nascent polypeptide.
As used herein, the term “oligonucleotide,” refers to a short length of single-stranded polynucleotide chain. Oligonucleotides are typically less than 200 residues long (e.g., between 15 and 100), however, as used herein, the term is also intended to encompass longer polynucleotide chains. Oligonucleotides are often referred to by their length. For example a 24 residue oligonucleotide is referred to as a “24-mer”. Oligonucleotides can form secondary and tertiary structures by self-hybridizing or by hybridizing to other polynucleotides. Such structures can include, but are not limited to, duplexes, hairpins, cruciforms, bends, and triplexes.
As used herein, the terms “complementary” or “complementarity” are used in reference to polynucleotides (i.e., a sequence of nucleotides) related by the base-pairing rules. For example, the sequence “5′-A-G-T-3′,” is complementary to the sequence “3′-T-C-A-5′.” Complementarity may be “partial,” in which only some of the nucleic acids' bases are matched according to the base pairing rules. Or, there may be “complete” or “total” complementarity between the nucleic acids. The degree of complementarity between nucleic acid strands has significant effects on the efficiency and strength of hybridization between nucleic acid strands. This is of particular importance in amplification reactions, as well as detection methods that depend upon binding between nucleic acids.
The term “homology” refers to a degree of complementarity. There may be partial homology or complete homology (i.e., identity). A partially complementary sequence is a nucleic acid molecule that at least partially inhibits a completely complementary nucleic acid molecule from hybridizing to a target nucleic acid is “substantially homologous.” The inhibition of hybridization of the completely complementary sequence to the target sequence may be examined using a hybridization assay (Southern or Northern blot, solution hybridization and the like) under conditions of low stringency. A substantially homologous sequence or probe will compete for and inhibit the binding (i.e., the hybridization) of a completely homologous nucleic acid molecule to a target under conditions of low stringency. This is not to say that conditions of low stringency are such that non-specific binding is permitted; low stringency conditions require that the binding of two sequences to one another be a specific (i.e., selective) interaction. The absence of non-specific binding may be tested by the use of a second target that is substantially non-complementary (e.g., less than about 30% identity); in the absence of non-specific binding the probe will not hybridize to the second non-complementary target.
As used herein, the term “hybridization” is used in reference to the pairing of complementary nucleic acids. Hybridization and the strength of hybridization (i.e., the strength of the association between the nucleic acids) is impacted by such factors as the degree of complementary between the nucleic acids, stringency of the conditions involved, the Tm of the formed hybrid, and the G:C ratio within the nucleic acids. A single molecule that contains pairing of complementary nucleic acids within its structure is said to be “self-hybridized.”
As used herein the term “stringency” is used in reference to the conditions of temperature, ionic strength, and the presence of other compounds such as organic solvents, under which nucleic acid hybridizations are conducted. Under “low stringency conditions” a nucleic acid sequence of interest will hybridize to its exact complement, sequences with single base mismatches, closely related sequences (e.g., sequences with 90% or greater homology), and sequences having only partial homology (e.g., sequences with 50-90% homology). Under “medium stringency conditions,” a nucleic acid sequence of interest will hybridize only to its exact complement, sequences with single base mismatches, and closely relation sequences (e.g., 90% or greater homology). Under “high stringency conditions,” a nucleic acid sequence of interest will hybridize only to its exact complement, and (depending on conditions such a temperature) sequences with single base mismatches. In other words, under conditions of high stringency the temperature can be raised so as to exclude hybridization to sequences with single base mismatches.
The term “isolated” when used in relation to a nucleic acid, as in “an isolated oligonucleotide” or “isolated polynucleotide” refers to a nucleic acid sequence that is identified and separated from at least one component or contaminant with which it is ordinarily associated in its natural source. Isolated nucleic acid is such present in a form or setting that is different from that in which it is found in nature. In contrast, non-isolated nucleic acids as nucleic acids such as DNA and RNA found in the state they exist in nature. For example, a given DNA sequence (e.g., a gene) is found on the host cell chromosome in proximity to neighboring genes; RNA sequences, such as a specific mRNA sequence encoding a specific protein, are found in the cell as a mixture with numerous other mRNAs that encode a multitude of proteins. However, isolated nucleic acid encoding a given protein includes, by way of example, such nucleic acid in cells ordinarily expressing the given protein where the nucleic acid is in a chromosomal location different from that of natural cells, or is otherwise flanked by a different nucleic acid sequence than that found in nature. The isolated nucleic acid, oligonucleotide, or polynucleotide may be present in single-stranded or double-stranded form. When an isolated nucleic acid, oligonucleotide or polynucleotide is to be utilized to express a protein, the oligonucleotide or polynucleotide will contain at a minimum the sense or coding strand (i.e., the oligonucleotide or polynucleotide may be single-stranded), but may contain both the sense and anti-sense strands (i.e., the oligonucleotide or polynucleotide may be double-stranded).
As used herein, the term “purified” or “to purify” refers to the removal of components (e.g., contaminants) from a sample. For example, antibodies are purified by removal of contaminating non-immunoglobulin proteins; they are also purified by the removal of immunoglobulin that does not bind to the target molecule. The removal of non-immunoglobulin proteins and/or the removal of immunoglobulins that do not bind to the target molecule results in an increase in the percent of target-reactive immunoglobulins in the sample. In another example, recombinant polypeptides are expressed in bacterial host cells and the polypeptides are purified by the removal of host cell proteins; the percent of recombinant polypeptides is thereby increased in the sample.
As used herein, the term “sample” is used in its broadest sense. In one sense, it is meant to include a specimen or culture obtained from any source, as well as biological and environmental samples. Biological samples may be obtained from animals (including humans) and encompass fluids, solids, tissues, and gases. Biological samples include blood products, such as plasma, serum and the like. Such examples are not however to be construed as limiting the sample types applicable to the present invention.
The present invention relates to compositions and methods for cancer diagnosis, research and therapy, including but not limited to, cancer markers. In particular, the present invention relates to pseudogenes as diagnostic markers and clinical targets for prostate cancer.
For example, in some embodiments, ATP8A2-Ψ and DPP3 pseudogenes were identified as being specific to breast cancer and CXADR-Ψ, NDUFA9, EPCAM, PDAP1, RBM17 and CES7 pseudogenes and the KLK4-KLKP1 fusion were identified as being specific to prostate cancer. Sequences of exemplary pseudogenes are shown in
As described above, embodiments of the present invention provide diagnostic and screening methods that utilize the detection of pseudogenes (e.g., ATP8A2-Ψ, DPP3, CXADR-Ψ, NDUFA9, EPCAM, PDAP1, RBM17, CES7 and KLK4-KLKP1). Exemplary, non-limiting methods are described below.
Any patient sample suspected of containing the pseudogenes may be tested according to methods of embodiments of the present invention. By way of non-limiting examples, the sample may be tissue (e.g., a prostate biopsy sample or a tissue sample obtained by prostatectomy), blood, urine, semen, prostatic secretions or a fraction thereof (e.g., plasma, serum, urine supernatant, urine cell pellet or prostate cells). A urine sample is preferably collected immediately following an attentive digital rectal examination (DRE), which causes prostate cells from the prostate gland to shed into the urinary tract.
In some embodiments, the patient sample is subjected to preliminary processing designed to isolate or enrich the sample for the pseudogenes or cells that contain the pseudogenes. A variety of techniques known to those of ordinary skill in the art may be used for this purpose, including but not limited to: centrifugation; immunocapture; cell lysis; and, nucleic acid target capture (See, e.g., EP Pat. No. 1 409 727, herein incorporated by reference in its entirety).
The pseudogenes may be detected along with other markers in a multiplex or panel format. Markers are selected for their predictive value alone or in combination with the pseudogenes. Exemplary prostate cancer markers include, but are not limited to: AMACR/P504S (U.S. Pat. No. 6,262,245); PCA3 (U.S. Pat. No. 7,008,765); PCGEM1 (U.S. Pat. No. 6,828,429); prostein/P501S, P503S, P504S, P509S, P510S, prostase/P703P, P710P (U.S. Publication No. 20030185830); RAS/KRAS (Bos, Cancer Res. 49:4682-89 (1989); Kranenburg, Biochimica et Biophysica Acta 1756:81-82 (2005)); and, those disclosed in U.S. Pat. Nos. 5,854,206 and 6,034,218, 7,229,774, each of which is herein incorporated by reference in its entirety. Markers for other cancers, diseases, infections, and metabolic conditions are also contemplated for inclusion in a multiplex or panel format.
i. DNA and RNA Detection
The pseudogenes of the present invention are detected using a variety of nucleic acid techniques known to those of ordinary skill in the art, including but not limited to: nucleic acid sequencing; nucleic acid hybridization; and, nucleic acid amplification.
1. Sequencing
Illustrative non-limiting examples of nucleic acid sequencing techniques include, but are not limited to, chain terminator (Sanger) sequencing and dye terminator sequencing. Those of ordinary skill in the art will recognize that because RNA is less stable in the cell and more prone to nuclease attack experimentally RNA is usually reverse transcribed to DNA before sequencing.
Chain terminator sequencing uses sequence-specific termination of a DNA synthesis reaction using modified nucleotide substrates. Extension is initiated at a specific site on the template DNA by using a short radioactive, or other labeled, oligonucleotide primer complementary to the template at that region. The oligonucleotide primer is extended using a DNA polymerase, standard four deoxynucleotide bases, and a low concentration of one chain terminating nucleotide, most commonly a di-deoxynucleotide. This reaction is repeated in four separate tubes with each of the bases taking turns as the di-deoxynucleotide. Limited incorporation of the chain terminating nucleotide by the DNA polymerase results in a series of related DNA fragments that are terminated only at positions where that particular di-deoxynucleotide is used. For each reaction tube, the fragments are size-separated by electrophoresis in a slab polyacrylamide gel or a capillary tube filled with a viscous polymer. The sequence is determined by reading which lane produces a visualized mark from the labeled primer as you scan from the top of the gel to the bottom.
Dye terminator sequencing alternatively labels the terminators. Complete sequencing can be performed in a single reaction by labeling each of the di-deoxynucleotide chain-terminators with a separate fluorescent dye, which fluoresces at a different wavelength.
A variety of nucleic acid sequencing methods are contemplated for use in the methods of the present disclosure including, for example, chain terminator (Sanger) sequencing, dye terminator sequencing, and high-throughput sequencing methods. Many of these sequencing methods are well known in the art. See, e.g., Sanger et al., Proc. Natl. Acad. Sci. USA 74:5463-5467 (1997); Maxam et al., Proc. Natl. Acad. Sci. USA 74:560-564 (1977); Drmanac, et al., Nat. Biotechnol. 16:54-58 (1998); Kato, Int. J. Clin. Exp. Med. 2:193-202 (2009); Ronaghi et al., Anal. Biochem. 242:84-89 (1996); Margulies et al., Nature 437:376-380 (2005); Ruparel et al., Proc. Natl. Acad. Sci. USA 102:5932-5937 (2005), and Harris et al., Science 320:106-109 (2008); Levene et al., Science 299:682-686 (2003); Korlach et al., Proc. Natl. Acad. Sci. USA 105:1176-1181 (2008); Branton et al., Nat. Biotechnol. 26(10):1146-53 (2008); Eid et al., Science 323:133-138 (2009); each of which is herein incorporated by reference in its entirety.
A number of DNA sequencing techniques are known in the art, including fluorescence-based sequencing methodologies (See, e.g., Birren et al., Genome Analysis: Analyzing DNA, 1, Cold Spring Harbor, N.Y.; herein incorporated by reference in its entirety). In some embodiments, automated sequencing techniques understood in that art are utilized. In some embodiments, parallel sequencing of partitioned amplicons (PCT Publication No: WO2006084132 to Kevin McKernan et al., herein incorporated by reference in its entirety) is utilized. In some embodiments, bridge amplification (see, e.g., WO 2000/018957, U.S. Pat. Nos. 7,972,820; 7,790,418 and Adessi et al., Nucleic Acids Research (2000): 28(20): E87; each of which are herein incorporated by reference) is utilized. In some embodiments, DNA sequencing by parallel oligonucleotide extension (See, e.g., U.S. Pat. No. 5,750,341 to Macevicz et al., and U.S. Pat. No. 6,306,597 to Macevicz et al., both of which are herein incorporated by reference in their entireties) is utilized. Additional examples of sequencing techniques include the Church polony technology (Mitra et al., 2003, Analytical Biochemistry 320, 55-65; Shendure et al., 2005 Science 309, 1728-1732; U.S. Pat. No. 6,432,360, U.S. Pat. No. 6,485,944, U.S. Pat. No. 6,511,803; herein incorporated by reference in their entireties), the 454 picotiter pyrosequencing technology (Margulies et al., 2005 Nature 437, 376-380; US 20050130173; herein incorporated by reference in their entireties), the Solexa single base addition technology (Bennett et al., 2005, Pharmacogenomics, 6, 373-382; U.S. Pat. No. 6,787,308; U.S. Pat. No. 6,833,246; herein incorporated by reference in their entireties), the Lynx massively parallel signature sequencing technology (Brenner et al. (2000). Nat. Biotechnol. 18:630-634; U.S. Pat. No. 5,695,934; U.S. Pat. No. 5,714,330; herein incorporated by reference in their entireties), and the Adessi PCR colony technology (Adessi et al. (2000). Nucleic Acid Res. 28, E87; WO 00018957; herein incorporated by reference in its entirety).
Next-generation sequencing (NGS) methods share the common feature of massively parallel, high-throughput strategies, with the goal of lower costs in comparison to older sequencing methods (see, e.g., Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; each herein incorporated by reference in their entirety). NGS methods can be broadly divided into those that typically use template amplification and those that do not. Amplification-requiring methods include pyrosequencing commercialized by Roche as the 454 technology platforms (e.g., GS 20 and GS FLX), the Solexa platform commercialized by Illumina, and the Supported Oligonucleotide Ligation and Detection (SOLiD) platform commercialized by Applied Biosystems. Non-amplification approaches, also known as single-molecule sequencing, are exemplified by the HeliScope platform commercialized by Helicos BioSciences, and emerging platforms commercialized by VisiGen, Oxford Nanopore Technologies Ltd., Life Technologies/Ion Torrent, and Pacific Biosciences, respectively.
2. Hybridization
Illustrative non-limiting examples of nucleic acid hybridization techniques include, but are not limited to, in situ hybridization (ISH), microarray, and Southern or Northern blot. In situ hybridization (ISH) is a type of hybridization that uses a labeled complementary DNA or RNA strand as a probe to localize a specific DNA or RNA sequence in a portion or section of tissue (in situ), or, if the tissue is small enough, the entire tissue (whole mount ISH). DNA ISH can be used to determine the structure of chromosomes. RNA ISH is used to measure and localize mRNAs and other transcripts (e.g., pseudogenes) within tissue sections or whole mounts. Sample cells and tissues are usually treated to fix the target transcripts in place and to increase access of the probe. The probe hybridizes to the target sequence at elevated temperature, and then the excess probe is washed away. The probe that was labeled with either radio-, fluorescent- or antigen-labeled bases is localized and quantitated in the tissue using either autoradiography, fluorescence microscopy or immunohistochemistry, respectively. ISH can also use two or more probes, labeled with radioactivity or the other non-radioactive labels, to simultaneously detect two or more transcripts.
In some embodiments, pseudogenes are detected using fluorescence in situ hybridization (FISH). In some embodiments, FISH assays utilize bacterial artificial chromosomes (BACs). These have been used extensively in the human genome sequencing project (see Nature 409: 953-958 (2001)) and clones containing specific BACs are available through distributors that can be located through many sources, e.g., NCBI. Each BAC clone from the human genome has been given a reference name that unambiguously identifies it. These names can be used to find a corresponding GenBank sequence and to order copies of the clone from a distributor.
The present invention further provides a method of performing a FISH assay on human prostate cells, human prostate tissue or on the fluid surrounding said human prostate cells or human prostate tissue. Specific protocols are well known in the art and can be readily adapted for the present invention. Guidance regarding methodology may be obtained from many references including: In situ Hybridization: Medical Applications (eds. G. R. Coulton and J. de Belleroche), Kluwer Academic Publishers, Boston (1992); In situ Hybridization: In Neurobiology; Advances in Methodology (eds. J. H. Eberwine, K. L. Valentino, and J. D. Barchas), Oxford University Press Inc., England (1994); In situ Hybridization: A Practical Approach (ed. D. G. Wilkinson), Oxford University Press Inc., England (1992)); Kuo, et al., Am. J. Hum. Genet. 49:112-119 (1991); Klinger, et al., Am. J. Hum. Genet. 51:55-65 (1992); and Ward, et al., Am. J. Hum. Genet. 52:854-865 (1993)). There are also kits that are commercially available and that provide protocols for performing FISH assays (available from e.g., Oncor, Inc., Gaithersburg, Md.). Patents providing guidance on methodology include U.S. Pat. Nos. 5,225,326; 5,545,524; 6,121,489 and 6,573,043. All of these references are hereby incorporated by reference in their entirety and may be used along with similar references in the art and with the information provided in the Examples section herein to establish procedural steps convenient for a particular laboratory.
3. Microarrays
Different kinds of biological assays are called microarrays including, but not limited to: DNA microarrays (e.g., cDNA microarrays and oligonucleotide microarrays); protein microarrays; tissue microarrays; transfection or cell microarrays; chemical compound microarrays; and, antibody microarrays. A DNA microarray, commonly known as gene chip, DNA chip, or biochip, is a collection of microscopic DNA spots attached to a solid surface (e.g., glass, plastic or silicon chip) forming an array for the purpose of expression profiling or monitoring expression levels for thousands of genes simultaneously. The affixed DNA segments are known as probes, thousands of which can be used in a single DNA microarray. Microarrays can be used to identify disease genes or transcripts (e.g., pseudogenes) by comparing gene expression in disease and normal cells. Microarrays can be fabricated using a variety of technologies, including but not limiting: printing with fine-pointed pins onto glass slides; photolithography using pre-made masks; photolithography using dynamic micromirror devices; ink jet printing; or, electrochemistry on microelectrode arrays.
Southern and Northern blotting is used to detect specific DNA or RNA sequences, respectively. DNA or RNA extracted from a sample is fragmented, electrophoretically separated on a matrix gel, and transferred to a membrane filter. The filter bound DNA or RNA is subject to hybridization with a labeled probe complementary to the sequence of interest. Hybridized probe bound to the filter is detected. A variant of the procedure is the reverse Northern blot, in which the substrate nucleic acid that is affixed to the membrane is a collection of isolated DNA fragments and the probe is RNA extracted from a tissue and labeled.
3. Amplification
Nucleic acids (e.g., pseudogenes) may be amplified prior to or simultaneous with detection. Illustrative non-limiting examples of nucleic acid amplification techniques include, but are not limited to, polymerase chain reaction (PCR), reverse transcription polymerase chain reaction (RT-PCR), transcription-mediated amplification (TMA), ligase chain reaction (LCR), strand displacement amplification (SDA), and nucleic acid sequence based amplification (NASBA). Those of ordinary skill in the art will recognize that certain amplification techniques (e.g., PCR) require that RNA be reversed transcribed to DNA prior to amplification (e.g., RT-PCR), whereas other amplification techniques directly amplify RNA (e.g., TMA and NASBA).
The polymerase chain reaction (U.S. Pat. Nos. 4,683,195, 4,683,202, 4,800,159 and 4,965,188, each of which is herein incorporated by reference in its entirety), commonly referred to as PCR, uses multiple cycles of denaturation, annealing of primer pairs to opposite strands, and primer extension to exponentially increase copy numbers of a target nucleic acid sequence. In a variation called RT-PCR, reverse transcriptase (RT) is used to make a complementary DNA (cDNA) from mRNA, and the cDNA is then amplified by PCR to produce multiple copies of DNA. For other various permutations of PCR see, e.g., U.S. Pat. Nos. 4,683,195, 4,683,202 and 4,800,159; Mullis et al., Meth. Enzymol. 155: 335 (1987); and, Murakawa et al., DNA 7: 287 (1988), each of which is herein incorporated by reference in its entirety.
Transcription mediated amplification (U.S. Pat. Nos. 5,480,784 and 5,399,491, each of which is herein incorporated by reference in its entirety), commonly referred to as TMA, synthesizes multiple copies of a target nucleic acid sequence autocatalytically under conditions of substantially constant temperature, ionic strength, and pH in which multiple RNA copies of the target sequence autocatalytically generate additional copies. See, e.g., U.S. Pat. Nos. 5,399,491 and 5,824,518, each of which is herein incorporated by reference in its entirety. In a variation described in U.S. Publ. No. 20060046265 (herein incorporated by reference in its entirety), TMA optionally incorporates the use of blocking moieties, terminating moieties, and other modifying moieties to improve TMA process sensitivity and accuracy.
The ligase chain reaction (Weiss, R., Science 254: 1292 (1991), herein incorporated by reference in its entirety), commonly referred to as LCR, uses two sets of complementary DNA oligonucleotides that hybridize to adjacent regions of the target nucleic acid. The DNA oligonucleotides are covalently linked by a DNA ligase in repeated cycles of thermal denaturation, hybridization and ligation to produce a detectable double-stranded ligated oligonucleotide product. Strand displacement amplification (Walker, G. et al., Proc. Natl. Acad. Sci. USA 89: 392-396 (1992); U.S. Pat. Nos. 5,270,184 and 5,455,166, each of which is herein incorporated by reference in its entirety), commonly referred to as SDA, uses cycles of annealing pairs of primer sequences to opposite strands of a target sequence, primer extension in the presence of a dNTPαS to produce a duplex hemiphosphorothioated primer extension product, endonuclease-mediated nicking of a hemimodified restriction endonuclease recognition site, and polymerase-mediated primer extension from the 3′ end of the nick to displace an existing strand and produce a strand for the next round of primer annealing, nicking and strand displacement, resulting in geometric amplification of product. Thermophilic SDA (tSDA) uses thermophilic endonucleases and polymerases at higher temperatures in essentially the same method (EP Pat. No. 0 684 315).
Other amplification methods include, for example: nucleic acid sequence based amplification (U.S. Pat. No. 5,130,238, herein incorporated by reference in its entirety), commonly referred to as NASBA; one that uses an RNA replicase to amplify the probe molecule itself (Lizardi et al., BioTechnol. 6: 1197 (1988), herein incorporated by reference in its entirety), commonly referred to as Qβ replicase; a transcription based amplification method (Kwoh et al., Proc. Natl. Acad. Sci. USA 86:1173 (1989)); and, self-sustained sequence replication (Guatelli et al., Proc. Natl. Acad. Sci. USA 87: 1874 (1990), each of which is herein incorporated by reference in its entirety). For further discussion of known amplification methods see Persing, David H., “In Vitro Nucleic Acid Amplification Techniques” in Diagnostic Medical Microbiology: Principles and Applications (Persing et al., Eds.), pp. 51-87 (American Society for Microbiology, Washington, D.C. (1993)).
4. Detection Methods
Non-amplified or amplified nucleic acids can be detected by any conventional means. For example, the pseudogenes can be detected by hybridization with a detectably labeled probe and measurement of the resulting hybrids. Illustrative non-limiting examples of detection methods are described below.
One illustrative detection method, the Hybridization Protection Assay (HPA) involves hybridizing a chemiluminescent oligonucleotide probe (e.g., an acridinium ester-labeled (AE) probe) to the target sequence, selectively hydrolyzing the chemiluminescent label present on unhybridized probe, and measuring the chemiluminescence produced from the remaining probe in a luminometer. See, e.g., U.S. Pat. No. 5,283,174 and Norman C. Nelson et al., Nonisotopic Probing, Blotting, and Sequencing, ch. 17 (Larry J. Kricka ed., 2d ed. 1995, each of which is herein incorporated by reference in its entirety).
Another illustrative detection method provides for quantitative evaluation of the amplification process in real-time. Evaluation of an amplification process in “real-time” involves determining the amount of amplicon in the reaction mixture either continuously or periodically during the amplification reaction, and using the determined values to calculate the amount of target sequence initially present in the sample. A variety of methods for determining the amount of initial target sequence present in a sample based on real-time amplification are well known in the art. These include methods disclosed in U.S. Pat. Nos. 6,303,305 and 6,541,205, each of which is herein incorporated by reference in its entirety. Another method for determining the quantity of target sequence initially present in a sample, but which is not based on a real-time amplification, is disclosed in U.S. Pat. No. 5,710,029, herein incorporated by reference in its entirety.
Amplification products may be detected in real-time through the use of various self-hybridizing probes, most of which have a stem-loop structure. Such self-hybridizing probes are labeled so that they emit differently detectable signals, depending on whether the probes are in a self-hybridized state or an altered state through hybridization to a target sequence. By way of non-limiting example, “molecular torches” are a type of self-hybridizing probe that includes distinct regions of self-complementarity (referred to as “the target binding domain” and “the target closing domain”) which are connected by a joining region (e.g., non-nucleotide linker) and which hybridize to each other under predetermined hybridization assay conditions. In a preferred embodiment, molecular torches contain single-stranded base regions in the target binding domain that are from 1 to about 20 bases in length and are accessible for hybridization to a target sequence present in an amplification reaction under strand displacement conditions. Under strand displacement conditions, hybridization of the two complementary regions, which may be fully or partially complementary, of the molecular torch is favored, except in the presence of the target sequence, which will bind to the single-stranded region present in the target binding domain and displace all or a portion of the target closing domain. The target binding domain and the target closing domain of a molecular torch include a detectable label or a pair of interacting labels (e.g., luminescent/quencher) positioned so that a different signal is produced when the molecular torch is self-hybridized than when the molecular torch is hybridized to the target sequence, thereby permitting detection of probe:target duplexes in a test sample in the presence of unhybridized molecular torches. Molecular torches and a variety of types of interacting label pairs are disclosed in U.S. Pat. No. 6,534,274, herein incorporated by reference in its entirety.
Another example of a detection probe having self-complementarity is a “molecular beacon.” Molecular beacons include nucleic acid molecules having a target complementary sequence, an affinity pair (or nucleic acid arms) holding the probe in a closed conformation in the absence of a target sequence present in an amplification reaction, and a label pair that interacts when the probe is in a closed conformation. Hybridization of the target sequence and the target complementary sequence separates the members of the affinity pair, thereby shifting the probe to an open conformation. The shift to the open conformation is detectable due to reduced interaction of the label pair, which may be, for example, a fluorophore and a quencher (e.g., DABCYL and EDANS). Molecular beacons are disclosed in U.S. Pat. Nos. 5,925,517 and 6,150,097, herein incorporated by reference in its entirety.
Other self-hybridizing probes are well known to those of ordinary skill in the art. By way of non-limiting example, probe binding pairs having interacting labels, such as those disclosed in U.S. Pat. No. 5,928,862 (herein incorporated by reference in its entirety) might be adapted for use in the present invention. Probe systems used to detect single nucleotide polymorphisms (SNPs) might also be utilized in the present invention. Additional detection systems include “molecular switches,” as disclosed in U.S. Publ. No. 20050042638, herein incorporated by reference in its entirety. Other probes, such as those comprising intercalating dyes and/or fluorochromes, are also useful for detection of amplification products in the present invention. See, e.g., U.S. Pat. No. 5,814,447 (herein incorporated by reference in its entirety).
ii. Data Analysis
In some embodiments, a computer-based analysis program is used to translate the raw data generated by the detection assay (e.g., the presence, absence, or amount of a given marker or markers) into data of predictive value for a clinician. The clinician can access the predictive data using any suitable means. Thus, in some preferred embodiments, the present invention provides the further benefit that the clinician, who is not likely to be trained in genetics or molecular biology, need not understand the raw data. The data is presented directly to the clinician in its most useful form. The clinician is then able to immediately utilize the information in order to optimize the care of the subject.
The present invention contemplates any method capable of receiving, processing, and transmitting the information to and from laboratories conducting the assays, information provides, medical personal, and subjects. For example, in some embodiments of the present invention, a sample (e.g., a biopsy or a serum or urine sample) is obtained from a subject and submitted to a profiling service (e.g., clinical lab at a medical facility, genomic profiling business, etc.), located in any part of the world (e.g., in a country different than the country where the subject resides or where the information is ultimately used) to generate raw data. Where the sample comprises a tissue or other biological sample, the subject may visit a medical center to have the sample obtained and sent to the profiling center, or subjects may collect the sample themselves (e.g., a urine sample) and directly send it to a profiling center. Where the sample comprises previously determined biological information, the information may be directly sent to the profiling service by the subject (e.g., an information card containing the information may be scanned by a computer and the data transmitted to a computer of the profiling center using an electronic communication systems). Once received by the profiling service, the sample is processed and a profile is produced (i.e., expression data), specific for the diagnostic or prognostic information desired for the subject.
The profile data is then prepared in a format suitable for interpretation by a treating clinician. For example, rather than providing raw expression data, the prepared format may represent a diagnosis or risk assessment (e.g., presence or absence of a pseudogene) for the subject, along with recommendations for particular treatment options. The data may be displayed to the clinician by any suitable method. For example, in some embodiments, the profiling service generates a report that can be printed for the clinician (e.g., at the point of care) or displayed to the clinician on a computer monitor.
In some embodiments, the information is first analyzed at the point of care or at a regional facility. The raw data is then sent to a central processing facility for further analysis and/or to convert the raw data to information useful for a clinician or patient. The central processing facility provides the advantage of privacy (all data is stored in a central facility with uniform security protocols), speed, and uniformity of data analysis. The central processing facility can then control the fate of the data following treatment of the subject. For example, using an electronic communication system, the central facility can provide data to the clinician, the subject, or researchers.
In some embodiments, the subject is able to directly access the data using the electronic communication system. The subject may chose further intervention or counseling based on the results. In some embodiments, the data is used for research use. For example, the data may be used to further optimize the inclusion or elimination of markers as useful indicators of a particular condition or stage of disease or as a companion diagnostic to determine a treatment course of action.
iii. In Vivo Imaging
Pseudogenes (e.g., ATP8A2-Ψ, DPP3, CXADR-Ψ, NDUFA9, EPCAM, PDAP1, RBM17, CES7 and KLK4-KLKP1) may also be detected using in vivo imaging techniques, including but not limited to: radionuclide imaging; positron emission tomography (PET); computerized axial tomography, X-ray or magnetic resonance imaging method, fluorescence detection, and chemiluminescent detection. In some embodiments, in vivo imaging techniques are used to visualize the presence of or expression of cancer markers in an animal (e.g., a human or non-human mammal). For example, in some embodiments, cancer marker mRNA or protein is labeled using a labeled antibody specific for the cancer marker. A specifically bound and labeled antibody can be detected in an individual using an in vivo imaging method, including, but not limited to, radionuclide imaging, positron emission tomography, computerized axial tomography, X-ray or magnetic resonance imaging method, fluorescence detection, and chemiluminescent detection. Methods for generating antibodies to the cancer markers of the present invention are described below.
The in vivo imaging methods of embodiments of the present invention are useful in the identification of cancers that express pseudogenes (e.g., prostate cancer). In vivo imaging is used to visualize the presence or level of expression of a pseudogene. Such techniques allow for diagnosis without the use of an unpleasant biopsy. The in vivo imaging methods of embodiments of the present invention can further be used to detect metastatic cancers in other parts of the body.
In some embodiments, reagents (e.g., antibodies) specific for the cancer markers of the present invention are fluorescently labeled. The labeled antibodies are introduced into a subject (e.g., orally or parenterally). Fluorescently labeled antibodies are detected using any suitable method (e.g., using the apparatus described in U.S. Pat. No. 6,198,107, herein incorporated by reference).
In other embodiments, antibodies are radioactively labeled. The use of antibodies for in vivo diagnosis is well known in the art. Sumerdon et al., (Nucl. Med. Biol 17:247-254 [1990] have described an optimized antibody-chelator for the radioimmunoscintographic imaging of tumors using Indium-111 as the label. Griffin et al., (J Clin Onc 9:631-640 [1991]) have described the use of this agent in detecting tumors in patients suspected of having recurrent colorectal cancer. The use of similar agents with paramagnetic ions as labels for magnetic resonance imaging is known in the art (Lauffer, Magnetic Resonance in Medicine 22:339-342 [1991]). The label used will depend on the imaging modality chosen. Radioactive labels such as Indium-111, Technetium-99m, or Iodine-131 can be used for planar scans or single photon emission computed tomography (SPECT). Positron emitting labels such as Fluorine-19 can also be used for positron emission tomography (PET). For MRI, paramagnetic ions such as Gadolinium (III) or Manganese (II) can be used.
Radioactive metals with half-lives ranging from 1 hour to 3.5 days are available for conjugation to antibodies, such as scandium-47 (3.5 days) gallium-67 (2.8 days), gallium-68 (68 minutes), technetiium-99m (6 hours), and indium-111 (3.2 days), of which gallium-67, technetium-99m, and indium-111 are preferable for gamma camera imaging, gallium-68 is preferable for positron emission tomography.
A useful method of labeling antibodies with such radiometals is by means of a bifunctional chelating agent, such as diethylenetriaminepentaacetic acid (DTPA), as described, for example, by Khaw et al. (Science 209:295 [1980]) for In-111 and Tc-99m, and by Scheinberg et al. (Science 215:1511 [1982]). Other chelating agents may also be used, but the 1-(p-carboxymethoxybenzyl) EDTA and the carboxycarbonic anhydride of DTPA are advantageous because their use permits conjugation without affecting the antibody's immunoreactivity substantially.
Another method for coupling DPTA to proteins is by use of the cyclic anhydride of DTPA, as described by Hnatowich et al. (Int. J. Appl. Radiat. Isot. 33:327 [1982]) for labeling of albumin with In-111, but which can be adapted for labeling of antibodies. A suitable method of labeling antibodies with Tc-99m which does not use chelation with DPTA is the pretinning method of Crockford et al., (U.S. Pat. No. 4,323,546, herein incorporated by reference).
A method of labeling immunoglobulins with Tc-99m is that described by Wong et al. (Int. J. Appl. Radiat. Isot., 29:251 [1978]) for plasma protein, and recently applied successfully by Wong et al. (J. Nucl. Med., 23:229 [1981]) for labeling antibodies.
In the case of the radiometals conjugated to the specific antibody, it is likewise desirable to introduce as high a proportion of the radiolabel as possible into the antibody molecule without destroying its immunospecificity. A further improvement may be achieved by effecting radiolabeling in the presence of the pseudogene, to insure that the antigen binding site on the antibody will be protected. The antigen is separated after labeling.
In still further embodiments, in vivo biophotonic imaging (Xenogen, Almeda, Calif.) is utilized for in vivo imaging. This real-time in vivo imaging utilizes luciferase. The luciferase gene is incorporated into cells, microorganisms, and animals (e.g., as a fusion protein with a cancer marker of the present invention). When active, it leads to a reaction that emits light. A CCD camera and software is used to capture the image and analyze it.
iv. Compositions & Kits
Compositions for use in the diagnostic methods described herein include, but are not limited to, probes, amplification oligonucleotides, and the like. In some embodiments, kits include all components necessary, sufficient or useful for detecting the markers described herein (e.g., reagents, controls, instructions, etc.). The kits described herein find use in research, therapeutic, screening, and clinical applications.
The probe and antibody compositions of the present invention may also be provided in the form of an array.
In some embodiments, the present invention provides drug screening assays (e.g., to screen for anticancer drugs). The screening methods of the present invention utilize pseudogenes. For example, in some embodiments, the present invention provides methods of screening for compounds that alter (e.g., decrease) the expression or activity of pseudogenes. The compounds or agents may interfere with transcription, by interacting, for example, with the promoter region. The compounds or agents may interfere with mRNA (e.g., by RNA interference, antisense technologies, etc.). The compounds or agents may interfere with pathways that are upstream or downstream of the biological activity of pseudogenes. In some embodiments, candidate compounds are antisense or interfering RNA agents (e.g., oligonucleotides) directed against pseudogenes. In other embodiments, candidate compounds are antibodies or small molecules that specifically bind to a pseudogenes regulator or expression products inhibit its biological function.
In one screening method, candidate compounds are evaluated for their ability to alter pseudogenes expression by contacting a compound with a cell expressing a pseudogene and then assaying for the effect of the candidate compounds on expression. In some embodiments, the effect of candidate compounds on expression of pseudogenes is assayed for by detecting the level of pseudogene expressed by the cell. mRNA expression can be detected by any suitable method.
The following examples are provided in order to demonstrate and further illustrate certain preferred embodiments and aspects of the present invention and are not to be construed as limiting the scope thereof.
Paired end transcriptome sequence reads (2×40 and 2×80 base pairs) were obtained from 13 tissue types including breast, prostate, pancreas, gastric, melanoma, and other tissues comprising a total of over 293 individual samples (
Paired end transcriptome reads were mapped to the human genome (NCBI36/hg18) and University of California Santa Cruz (UCSC) Genes using Efficient Alignment of Nucleotide Databases (ELAND) software of the Illumina Genome Analyzer Pipeline, using 32 bp seed length, allowing up to 2 mismatches; detailed mapping status are represented in Table 2. Table 2 shows primary mapping status of individual sequencing lanes. Flowcell and lane ID (Column A), total number of reads (Column B), purity filter reads (Column C), followed by mapping count for each chromosome (hs_ref_chr1-22, X, and Y) including mitochondrial (chrM) and ribosomal sequences (humRibosomal) (Column D-AC). Passed purity filter reads obtained from Illumina export and extended output files (as described before) were parsed and binned into three major categories: 1. Both of the paired reads map to annotated genes 2. One or both of the paired reads map to un-annotated regions in the genome, and 3. Neither of the reads map (these include viral, bacterial, and other contaminant reads, as well as sequencing errors). The paired reads with one or both partners mapping to an un-annotated region were clustered based on overlaps of aligned sequences using the chromosomal coordinates of the clusters. Singleton reads that did not cluster or stacked\duplicated reads with same start and stop genomic-coordinates (potential PCR artifacts) were filtered out. Passed filter ‘clusters’ were defined as units of transcript expression (analogous to a ‘probe’ on microarray platforms). These ‘clusters’ were screened against publicly available human pseudogene resources, Yale human pseudogene Build 53-processed, duplicate and fragment entries (Karro et al., 2007 Nucleic acids research 35, D55-60) and Gencode Manual Gene Annotations (level 1+2), Automated Gene Annotations (level 3) (October 2009) (Zheng et al., 2007 Genome research 17, 839-851) to identify and annotate pseudogene ‘clusters’. The clusters were also subjected to homology search using the alignment tool BLAT (Kent, 2002 Genome research 12, 656-664) for an independent annotation. Sequence reads from individual samples were queried against the resultant clusters defined by the union of Yale, ENCODE and BLAT output to assess the expression of pseudogenes (
Data from individual samples was used to construct a matrix to carry out pseudogene expression profiling. Pseudogene transcripts (one or more probe(s) overlapping with either Yale or ENCODE) in 2 or more samples in a tissue type and its absence in all other tissue types was used to define it as tissue specific. If a pseudogene probe was detected in 10 out of 13 samples, it was designated as ubiquitous. All other cases were described as intermediate category. Pseudogene transcripts detected in 3 or more cancer samples and absent in all benign samples were assigned as cancer specific.
RNA Isolation and cDNA Synthesis
Total RNA was isolated using Trizol and an RNeasy Kit (Invitrogen) with DNase I digestion according to the manufacturer's instructions. RNA integrity was verified on an Agilent Bioanalyzer 2100 (Agilent Technologies, Palo Alto, Calif.). cDNA was synthesized from total RNA using Superscript III (Invitrogen) and random primers (Invitrogen).
Quantitative Real-time PCR (qPCR) was performed using Taqman or SYBR green based assays (Applied Biosystems, Foster City, Calif.) on an Applied Biosystems 7900HT Real-Time PCR System, according to standard protocols. The Taqman assays for CXADR and ATP8A2 assays were custom designed based on regions of differences between the wild type and pseudogene sequences. Oligonucleotide primers for SYBR green assays were obtained from Integrated DNA Technologies (Coralville, Iowa). The housekeeping gene, GAPDH, was used as a loading control. Fold changes were calculated relative to GAPDH and normalized to the median value of the benign samples.
Additionally, inventoried Taqman assay for CXADR-WT (Hs00154661_m1) and ATP8A2-WT (assay ID hs00185259_m1) were used.
Development of a Bioinformatics Platform for the Analysis of Pseudogene Transcription
High throughput transcriptome sequencing always yields a fraction of reads that do not map perfectly to the reference genome/transcriptome. This sequence fraction has been actively mined for aberrant transcripts including mutations (Shah et al., 2009a N Engl J Med 360, 2719-2729; Shah et al., 2009b Nature 461, 809-813; Tuch et al. PloS one 5, e9317), chimeric RNAs (Maher et al., 2009a Nature 458, 97-101; Maher et al., 2009b Proceedings of the National Academy of Sciences of the United States of America 106, 12353-12358), xenobiotic sequences, and non-coding RNAs (Gupta et al. Nature 464, 1071-1076; Loewer et al. Nature genetics 42, 1113-1117; Wilhelm et al., 2008 Nature 453, 1239-1243). In this study the sequencing reads that showed a highdegree of homology but imperfect match to the reference genes, but mapped perfectly elsewhere in the genome, were used to serve as the primary data for pseudogene expression analysis (
Overall, 2156 unique clusters were defined in terms of their genomic coordinates (start and end points) that were compared against annotated pseudogenes in the ENCODE and the Yale Gerstein Group (referred to as Yale) (Kano et al., 2007 Nucleic acids research 35, D55-60) databases, the two most comprehensive pseudogene annotation resources. Of the 2156 initial clusters, 934 overlapped with both the Yale and ENCODE databases, whereas 81 were found only in the Yale database and 15 only in the ENCODE database, overall accounting for 1506 distinct pseudogene transcripts corresponding to 1000 unique genes (
Some clusters were seen to be localized in the vicinity of known pseudogenes. 92 clusters that resided adjacent (within 5 kb) to previously annotated pseudogenes (
Further, taking into account the instances of multiple clusters representing the same pseudogene transcript, the 2156 transcript clusters overall amounted to transcriptional evidence of 2082 distinct pseudogenes, of which 1506 transcripts correspond to specific genomic coordinates in Yale and/or ENCODE pseudogenes, and as many as 576 potentially novel transcripts (described below) (
A closer look at the coverage of pseudogene clusters across the sample-wise compendium reveals that pseudogenes of housekeeping genes such as ribosomal and proteins are widely expressed across tissue types. Pseudogene transcripts corresponding to ribosomal proteins RPL-1, -3, -5, -6, -8, -9, -10, -11, -13, -18, -22, -23, -27, -28 and RPS-5, -6, -10, -11, -14, -16, -18, -20, -21 etc. were all observed in more than 50 samples each. Apart from housekeeping genes, several of the pseudogenes seen to be widely expressed included CALM2 (calmodulin 2 phosphorylase kinase, delta), TOMM40 (translocase of outer mitochondrial membrane 40), NONO (non-POU domain containing, octamer-binding), DUSP8 (dual specificity phosphatase 8), PERP (TP53 apoptosis effector), YES (v-yes-1 Yamaguchi sarcoma viral oncogene homolog 1) and others, were all found to be expressed in more than 50 of the samples examined by RNA-seq, as well as independently verified by pseudogene specific RT-PCR followed by validation through Sanger sequencing (Table 6A and 7,
Further, because the RNA-Seq compendium is comprised of 35-45 mer short sequence reads that largely generated short sequence clusters not optimal for pseudogene analysis tools such as Pseudopipe (Zhang et al., 2006, supra) and Pseudofam (Lam et al., 2009, supra) used in generating ENCODE and Yale databases, we also carried out a direct query of individual clusters against the human genome (hg18) using the BLAT tool from UCSC, that is ideally suited for short sequence alignment searches (Kent, 2002 Genome research 12, 656-664). Using this BLAT analysis, also referred as the “custom” analysis or simply BLAT (
Given the utility of RNA-seq in unraveling pseudogene transcription, the technical and analytical factors influencing the yield of pseudogene transcripts were assesed through transcriptome sequencing. A positive correlation was observed between the sequencing depth and total number of pseudogene transcripts (correlation co-efficient, +0.65) (
To explore regulatory elements associated with expressed pseudogenes, the pseudogene loci was subjected to two independent promoter analysis tools namely Transfac and Genomatix-Promoter Inspector. These tools provide information only on previously annotated promoter elements associated with known genes and were unable to identify potential promoter elements associated with the pseudogene loci solely based on query sequences. Therefore, ChIP-seq analysis of a breast cancer cell line MCF7 probed with H3K4me3, a histone mark associated with transcriptionally active chromosomal loci, was performed and the results were interogated with the MCF7 pseudogene transcript data. A statistically significant enrichment of H3K4me3 peaks at expressed pseudogene loci as compared to non-expressed pseudogenes (p-value=0.0145) was observed (FIG. S4), indicating that the pseudogene transcripts observed by RNA-seq are associated with transcriptionally active genomic loci. The pseudogene transcripts associated with H3K4Me3 peaks encompass both unprocessed and processed pseudogenes, with no discernible differences in the pattern of expression. Next, the correlation between the expression of pseudogenes present within the introns of unrelated, expressed genes, with their ‘host’ genes was assessed. No significant association was observed, indicating that pseudogenes are likely subject to independent regulatory mechanisms—even when residing within other transcriptionally active genes. Further, the observations with the breast specific unprocessed pseudogene ATP8A2 (likely arisen from duplication of wild type ATP8A2, thus likely harboring similar promoter elements) also indicate that there is no apparent correlation between the pseudogene expression with the wild type gene that is expressed ubiquitously (
Next, the expression patterns of the pseudogene transcripts in the RNA-seq compendium comprising of data from 248 cancer and 45 benign samples from 13 different tissue types (total 293 samples) were analyzed. Broad patterns of pseudogene expression, including 1056 pseudogenes that were detected in multiple samples (Table 3) were observed, which supports the hypothesis that transcribed pseudogenes contribute to the typical transcriptional repertoire of cells. In addition, distinct patterns of pseudogene expression, akin to that of protein-coding genes, including 154 highly tissuespecific and 848 moderately tissue-specific (or tissue-enriched) pseudogenes (
Of the 165 ubiquitous pseudogenes, a majority belonged to housekeeping genes, such as GAPDH, ribosomal proteins, several cytokeratins, and other genes widely expressed in most cell types. These genes are known to have numerous pseudogenes, and it is likely that several of these pseudogenes retain the capacity for widespread transcription, mimicking their protein-coding counterparts.
A second set of pseudogenes exhibited near-ubiquitous expression, but were frequently transcribed at lower levels in most tissues and robustly transcribed in one or two tissues. These pseudogenes were termed “non-specific”, and this group harbors more than 870 pseudogenes, comprising a large portion of our dataset (
A notable pseudogene not observed is the recently described PTENP1, a pseudogene of PTEN recently implicated in the biology of the PI-3K signaling pathway (Poliseno et al. Nature 465, 1033-1038). No sequencing reads for PTENP1 were observed in the entire compendium—possibly due to the preponderance of cancer samples in the cohort, which tend to show low expression or deletion of this pseudogene (Poliseno et al. Nature 465, 1033-1038).
154 pseudogenes with highly specific expression patterns were observed, including pseudogenes derived from AURKA (kidney samples), RHOB (colon samples), and HMGB1 (myeloproliferative neoplasms (MPNs)) (
Because the sample compendium has a substantial number of cancer samples, pseudogenes with cancer-specific expression were next investigated. While a majority of the pseudogenes examined were found in both cancer and benign samples, 218 pseudogenes were expressed only in cancer samples, of which 178 were observed in multiple cancers and 40 were found to have highly-specific expression in a single cancer type only (
Among the cancer-specific pseudogenes, a few noteworthy examples included pseudogenes derived from the eukaryotic translation initiation factors EIF4A1 and EIF4H, the heterogeneous nuclear ribonucleoprotein HNRPH2, and the small nuclear ribonucleoprotein SNRPG (
To investigate individual pseudogenes in greater detail, pseudogenes associated with breast and prostate cancer were investigated. Among the candidates in breast cancer, a novel unprocessed pseudogene cognate to ATP8A2, a LIM domain containing protein speculated to be associated with stress response and proliferative activity (Khoo et al., 1997 Protein expression and purification 9, 379-387), and DPP3, a metallopeptidase shown to have increased activity in endometrial and ovarian cancers (Simaga et al., 2003 Gynecologic oncology 91, 194-200) were investigated (
Approximately 25% of breast tumors demonstrate extremely high levels of this pseudogene, indicating that ATP8A2-Ψ may contribute to a particular subtype of breast cancer. ATP8A2-Ψexpression with respect to luminal and basal breast subtypes, two prominent categories of breast cancer with distinct molecular and clinical characteristics was analyzed. It was found that ATP8A2-Ψ expression was restricted to tumors with luminal histology, whereas basal tumors showed minimal expression of this pseudogene (
To investigate the role of ATP8A2-Ψ expression in breast cancer, siRNA based knockdown of both the wild type and pseudogene RNA was pereformed in two independent breast cancer cell lines that expressed both the transcripts (
Analysis of tissue-specific pseudogenes restricted to prostate cancers identified numerous pseudogenes, including several derived from parental genes known to be altered or dysregulated in cancer (
Lastly, in the course of these analyses, a prostate cancer specific read-through transcript between KLK4, an androgen-induced gene, and KLKP1, an adjacent pseudogene was identified. This read-through results in a chimeric RNA transcript combining the first two exons of KLK4 with the last two exons of KLKP1; and this KLK4-KLKP1 transcript is predicted to retain an open reading frame incorporating 54 amino acids encoded by the KLKP1 pseudogene, indicating that it may encode a chimeric protein (
All publications, patents, patent applications and accession numbers mentioned in the above specification are herein incorporated by reference in their entirety. Although the invention has been described in connection with specific embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications and variations of the described compositions and methods of the invention will be apparent to those of ordinary skill in the art and are intended to be within the scope of the following claims.
This application claims priority U.S. Provisional Application No. 61/577,767, filed on Dec. 20, 2011, which is herein incorporated by reference in its entirety.
This invention was made with government support under CA132874 and CA111275 awarded by the National Institutes of Health and W81XWH-08-1-0110 awarded by the Army Medical Research and Materiel Command. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61577767 | Dec 2011 | US |