The instant application contains a Sequence Listing which has been submitted in ASCII format via EFS-Web and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Feb. 28, 2013 is named SEQ—037822-056701-C.txt and is 202,138 bytes in size.
The present invention is directed to methods of screening for malignancies, cellular disorders, and other physiological states as well as novel high-throughput, low-cost, and flexible solution-based methods for RNA expression profiling, including expression of microRNAs and mRNAs.
The availability of high-performance RNA profiling technologies is central to the elucidation of the mechanisms of action of disease genes and the identification of small molecule therapeutics by molecular signature screening (Lamb et al., Cell 114:323-34 (2003); Stegmaier et al., Nature Genetics 36:257-63 (2004)). For example, detection and quantification of differentially expressed genes in a number of conditions including malignancy, cellular disorders, etc. would be useful in the diagnosis, prognosis and treatment of these pathological conditions. Quantification of gene expression would also be useful in indicating susceptibility to a range of conditions and following up effects of pharmaceuticals or toxins on molecular level. These methods can also be used to screen for molecules that provide a desired gene profile.
The power of being able to simultaneously measure the expression level of multiple mRNA species has been of recent interest. For example, the expression of seventy and eighty-one transcripts have together been shown to outperform established clinical and histologic parameters in disease outcome prediction for breast cancer (van de Vijver et al., New Eng. J. Med. 347: 1999-2009 (2002)) and follicular lymphoma (Glas et al., Blood 105:301-7 (2005)), respectively.
MicroRNAs are thought to act as post-transcriptional modulators of gene expression, and have been implicated as regulators of developmental timing, neuronal differentiation, cell proliferation, programmed cell death, and fat metabolism. Determining expression profiles of microRNAs is particularly challenging however because of their short size, typically around 21 base pairs, and high degree of sequence homology, where different microRNAs may differ by only a single base pair. It would also be highly desirable to simultaneously measure the expression level of microRNAs, a recently identified class of small non-coding RNA species.
The rapid pace of discovery of new genes generated by large-scale genomic and proteomic initiatives has required the development of high-throughput strategies to quantify the expression of a large number of genes and their alternatively spliced isoforms, as well as elucidate their biological functions, regulations and interactions. (Consortium, E. P. (2004) Science 306, 636-40; Lander et al., Nature 409, 860-921 (2001)) A number of high-throughput techniques have been developed to detect and quantify nucleic acids. Microarray-based analysis has been one widely used high-throughput technique used to study nucleic acids. Another approach for high-throughput analysis of nucleic acids involves the sequencing of a short tag of each transcript, including expressed sequence tag (EST) sequencing (Lander et al., 2001) and serial analysis of gene expression (SAGE) (Velculescu et al., Science 270, 484-7 (1995)).
However, both microarray and tag-sequencing techniques are associated with a number of significant problems. These techniques typically are not sufficiently sensitive and demand relatively high input levels of mRNA that are often unavailable, particularly when studying human diseases. In addition, the array quality is often a problem for cDNA or oligonucleotide microarrays. For example, most researchers cannot confirm the identity of what is immobilized on the surface of a microarray and generally have limited capacity to check and control possible errors in the microarray fabrication. Additionally, the high costs of microarrays have caused many investigators to perform relatively few control experiments to assess the reliability, validity, and repeatability of their findings. Moreover, given the high costs of microarray fabrication, custom designing arrays to tailor analysis to an individual expression profile is simply impractical in many instances. For the tag-sequencing analysis, a large amount of sequencing effort, generally slow and costly, is needed for tag-based analysis and the sensitivity of tag-based analyses is relatively low and high sensitivity can only be achieved by sequencing a large number of tag sequences.
Thus it would be desirable to develop simple, flexible, low-cost, high-throughput methods for the sensitive and accurate quantification of nucleic acids, which can be easily automated and scaled up to accommodate testing of large numbers of samples and overcome the problems associated with available techniques. Such a method would permit diagnostic, prognostic and therapeutic purposes, and would facilitate genomic, pharmacogenomic and proteomic applications, including the discovery of small molecule therapeutics.
We have now discovered simple, flexible, low-cost and high-throughput solution-based methods for expression profiling nucleic acids. More specifically, the invention provides methods for detection of multiple genes in a single reaction, including for the detection of mRNAs and microRNAs.
The present invention provides a solution-based method for determining the expression level of a population of target nucleic acids, by a) providing in solution a population of target-specific bead sets, where each target-specific bead set is individually detectable and comprises a capture probe which corresponds to an individual target nucleic acid, referred to as an individual bead set; b) hybridizing in solution the population of target-specific bead sets with a population of molecules that can contain a population of detectable target molecules, where each target nucleic acid has been transformed into a corresponding detectable target molecule which will specifically bind to its corresponding individual target-specific bead set; and c) screening in solution for detectable target molecules hybridized to target-specific beads to determine the expression level of the population of target nucleic acids.
In one embodiment, the target-specific bead sets can have at least 5 individual bead sets that can bind with a corresponding set of target nucleic acids. The population of target-specific beads can contain at least 100 individual bead sets that bind with a corresponding set of target nucleic acids.
One preferred embodiment provides a method for detection of populations of mRNAs. In this method, mRNA is transformed into a corresponding detectable target molecule by a) reverse transcribing the mRNA to generate a cDNA; b) hybridizing an upstream probe and a downstream probe to the cDNA, where the upstream probe has a universal upstream sequence and an upstream target-specific sequence, and the downstream probe has a universal downstream sequence and a downstream target-specific sequence, such that when the upstream probe and the downstream probe are both hybridized to the cDNA the two probes are capable of being ligated; c) ligating the two probes to generate ligation complexes; and d) amplifying the ligation complexes with a universal upstream primer and a universal downstream primer, which are complementary to the universal upstream sequence and the universal downstream sequence, respectively. In this method, at least one of universal primers is detectably labeled, such that product of the amplification is detectably labeled, thereby generating a detectable target molecule which corresponds to the target nucleic acid. In this method, either the upstream probe or the downstream probe also has an amplicon tag between the universal sequence and the target-specific. The amplicon tag has a nucleic acid sequence that is unique for the mRNA to be detected, and that is complementary to the sequence of the capture probe of the corresponding bead set, allowing the detectable nucleic acid molecule to hybridize to the bead set with the complementary capture probe.
One embodiment of the invention provides the use of these multiplex mRNA detection methods to screen for the presence of a particular physiological state in a test sample, such as a malignancy, infection or a cellular disorder. In one embodiment, the genes which are specifically associated with one physiological state but not another physiological state are already determined; such a group of genes is typically referred to as an expression signature. To screen for a physiological state using the mRNA detection methods, one first determines the expression signature of a group of genes in the test sample; and then compares the expression signature between the test sample and a corresponding control sample, where a difference in the expression signature between the test sample and the control sample is indicative of the test sample comprising said malignant cells, infected cells or cellular disorder. In one embodiment, the expression signature has at least 5 genes.
One embodiment of the invention provides a method for identifying an expression signature for a physiological state, using the multiplex mRNA detection methods to rapidly screen for genes which are differentially expressed between two physiological states. In one embodiment, the expression signature has at least 5 genes. Examples of physiological states include the presence of a cancer, infection, or a cellular disorder. To identify novel expression signatures, one isolates cells from two groups of individuals, one with and one without the physiological state of interest, and then identifies those genes which are differentially expressed in the two groups of individuals. For those genes which differ at a statistically significant level, linear regression analysis can be applied to identify an expression signature of a gene group that is indicative of an individual having the physiological state of interest.
One preferred embodiment provides a method to detection of populations of microRNAs. In this method, microRNAs are transformed into corresponding detectable target molecules by first ligating at least one adaptor to each microRNA, generating an adaptor-microRNA molecule; and then detectably labeling the adaptor-microRNA molecule, thereby generating a detectable target molecule which corresponds to the target nucleic acid. In one embodiment, the adaptor-microRNA is detectably labeled by reverse transcription using the adaptor-microRNA as a template for polymerase chain reaction, wherein a pair of primers is used in said polymerase chain reaction, and wherein at least one of said primers is detectably labeled. In this method, the capture probe of the bead set which corresponds to an individual microRNA has a sequence which is complementary to the miRNA sequence, allowing the detectable target molecule to bind to the corresponding bead set.
The invention also provides the use of the multiplex microRNA detection methods to screen for the presence of a malignancy in a test sample. In one embodiment, one analyzes the level of expression of microRNAs in a test sample and a corresponding control sample, where a lower level of expression of microRNAs in the test sample relative to the control sample is indicative of the test sample containing malignant cells.
One embodiment of the invention provides a method of screening an individual at risk for cancer by obtaining at least two cell samples from the individual at different times; and determining the level of expression of microRNAs in the cell samples, where a lower level of expression of microRNAs in the later obtained cell sample compared to the earlier obtained cell sample is indicative of the individual being at risk for cancer.
Another embodiment of the invention provides methods of screening an individual at risk for cancer, by determining the level of expression for a specific group of microRNAs, sometimes referred to as a profile group of microRNAs, where lower expression of the profile group of microRNAs is associated with risk for a particular type of cancer.
One embodiment of the invention provides a method for identifying an active compound. In this embodiment, cells are contacted with a plurality of molecules including chemical compounds and biologic molecules, and the expression of a set of marker genes present in the cells is determined using the novel detection methods of the invention. To identify active compounds, the expression of the marker genes to identify a cellular phenotype is scored, the presence of a specific cellular phenotype being indicative of an active compound. In one embodiment the plurality of chemical compounds is a set of compounds selected from the group consisting of small molecule libraries, FDA approved drugs, synthetic chemical libraries, phage display libraries, dosage libraries. In another embodiment the active compound is an anti-cancer drug. In a further embodiment the active compound is a cellular differentiation factor. In certain embodiments, the set of marker genes can include genes encoding mRNAs and/or genes encoding microRNAs.
Another embodiment of the invention provides kits for determining in solution the expression level of a population of target nucleic acids. Kits can include a population of detectable bead sets, wherein each target-specific bead set is individually detectable and is capable of being coupled to a capture probe which corresponds to an individual target nucleic acid of interest; components for transforming a target nucleic acid of interest into a corresponding detectable target molecule which will specifically bind to its corresponding individual target-specific bead set; and instructions for performing the solution-based detection methods of the invention.
The invention is directed to the discovery and use of improved methods for expression profiling of nucleic acids. As will be discussed in detail below, we have found a simple and flexible method that permits us to rapidly and inexpensively measure gene expression of multiple genes in a single multiplex reaction, ranging from a few genes to 50, 60, 70, 90 or 200 or more genes. Using this method, we have analyzed microRNA and mRNA expression levels, and found these methods are highly efficient and as effective as commercial slide-based microarrays. However, unlike microarrays, the flexibility of the present method permits simple tailoring of the population of genes which can be analyzed in a single reaction. Thus, the present invention is particularly useful for gene expression profiling methods. In addition, using the methods of the invention, we have discovered that microRNAs are downregulated in a wide variety of cancers. Thus, the invention also provides methods for detection of cancer, using microRNA expression profiling.
In one embodiment, the method uses a population of bead sets and measures in solution the expression level of a population of target nucleic acids of interest in a sample. For each individual target nucleic acid of interest, there is a corresponding bead set which comprises a capture probe specific for its target nucleic acid and a unique detectable label, referred to as the bead signal. In this method, a target nucleic acid, such as mRNA in a cell, is first labeled with a detectable signal, referred to as the target signal, before being hybridized with the population of bead sets. Following hybridization in solution of the labeled target nucleic acids with the population of bead sets, the level of both detectable signals is determined for each hybridized bead-target complex. Thus, the bead signal indicates which target nucleic acid is present in the complex, and the level of the target signal indicates the level of expression of that target nucleic acid in the sample. The method can be used to detect tens, or hundreds, or thousands of different target nucleic acids in a single sample.
Accordingly, the invention provides simple, flexible, low-cost, high-throughput methods for simultaneously measuring the expression level of multiple nucleic acids, including mRNAs and microRNAs. In terms of multiplicity, the methods allow the expression level of a few to hundreds, and even thousands, of different target nucleic acids to be measured simultaneously in a single reaction (e.g. 5, 10, 50, 100, 500, or even 1,000 different target nucleic acids). In terms of throughput, the methods allow high numbers of the multiplexed samples to be processed simultaneously, allowing thousands of samples to be rapidly processed. The simplicity of the methods allows the entire procedure to be readily automated. The low cost aspect of the method is reflected for example in a typical unit cost of only several dollars to analyze the expression of 100 nucleic acids in a single sample. As exemplified herein, the performance of the present methods is at least comparable to the current industry-standard oligonucleotide microarrays.
One particularly important advantage of the present method is the high degree of flexibility it provides regarding the population of target nucleic acids to be analyzed. Because the population of bead sets is not fixed, as opposed to the probes on a microarray, the bead population can be readily changed by adding or removing one of the individual bead sets, without altering the other bead sets in the total population. Thus, unlike a slide-based microarray, the population of target nucleic acids to be analyzed can be readily tailored to specific needs, without refabrication of the entire population of bead sets.
The detection methods of the invention can be used in a wide variety of applications as described in detail below, including but not limited to gene expression profiling, screening assays, diagnostic and prognostic assays, for example for gene expression signatures, small molecule or genetic library screening, such as screening cDNA/ORFs, shRNAs, and microRNAs, pharmacogenomics, and the classification of induced biological states.
The invention provides a solution-based method for determining the expression level of a population of target nucleic acids. The method comprises the steps of (a) providing in solution a population of target-specific bead sets, wherein each target-specific bead set is individually detectable and comprises a capture probe which corresponds to an individual target nucleic acid referred to as an individual bead set; (b) hybridizing in solution the population of target-specific bead sets with a population of molecules that can contain a population of detectable target molecules, wherein each target nucleic acid has been transformed into a corresponding detectable target molecule which will specifically bind to its corresponding individual target-specific bead set; and (c) screening in solution for detectable target molecules hybridized to target-specific beads to determine the expression level of the population of target nucleic acids.
In one embodiment, the population of target-specific bead sets comprises at least 5 individual bead sets that can bind with a corresponding set of target nucleic acids. In one embodiment, the population of target-specific beads comprises at least 100 individual bead sets that can bind with a corresponding set of target nucleic acids.
In one embodiment, the population of target nucleic acids is a population of mRNAs. In one embodiment, the population of target nucleic acids is a population of microRNAs.
In one embodiment, each target nucleic acid is an mRNA which has been transformed into a corresponding detectable target molecule. The mRNA is transformed into a corresponding detectable target molecule by a process comprising the steps of (a) reverse transcribing the mRNA target nucleic acid to generate a cDNA; (b) contacting the cDNA with an upstream probe and a downstream probe, wherein the upstream probe comprises a universal upstream sequence and an upstream target-specific sequence, and the downstream probe comprises a universal downstream sequence and a downstream target-specific sequence, such that when the upstream probe and the downstream probe are both hybridized to the cDNA the two probes are capable of being ligated; (c) ligating said cDNA contacted with said upstream and downstream probes to generate ligation complexes; and (d) amplifying said ligation complexes with a pair of universal primers comprising a universal upstream primer and a universal downstream primer. The universal upstream primer is complementary to the universal upstream sequence and the universal downstream primer is complementary to the universal downstream sequence. At least one of the pair of universal primers is detectably labeled. The product of the amplification is detectably labeled. Accordingly, a detectable target molecule is generated which corresponds to the target nucleic acid.
In one embodiment, in the process of transforming the mRNA into a corresponding detectable target molecule, either the upstream probe further comprises an amplicon tag between the universal sequence and the target-specific sequence or the downstream probe further comprises an amplicon tag between the universal sequence and the target-specific sequence. The amplicon tag comprises a nucleic acid sequence that is complementary to the sequence of the capture probe of the bead set.
In one embodiment, each target nucleic acid is a microRNA which has been transformed into a corresponding detectable target molecule. The process of transforming the microRNA into a corresponding detectable target molecule comprises the steps of (a) ligating at least one adaptor to the microRNA, generating an adaptor-microRNA molecule; (b) detectably labeling said adaptor-microRNA molecule. Accordingly, a detectable target molecule is generated which corresponds to the target nucleic acid.
In one embodiment, the adaptor-microRNA is detectably labeled by reverse transcription using the adaptor-microRNA as a template for polymerase chain reaction. In one embodiment, a pair of primers is used in said polymerase chain reaction, and at least one of said primers is detectably labeled.
The present invention further provides a method of screening for the presence of malignancy, infection, cellular disorder, or response to a treatment in a test sample. The method comprises the steps of (a) determining the expression signature of a group of genes in the test sample; and (b) comparing the expression signature between the test sample and a reference sample. A similarity or difference in the expression signature between the test sample and the reference sample is indicative of the presence of malignant cells, infected cells, cellular disorder, or response to a treatment in the test sample. In one embodiment, the solution-based method for determining the expression level of target nucleic acids is used for determination of the expression signature in the test sample and the target nucleic acids are mRNAs. In one embodiment, the expression signature comprises at least 5 genes.
In one embodiment, the reference sample is known to express a predetermined expression signature indicative of the presence of malignancy, infection, or cellular disorder, and the similarity of the expression signature of the test sample to the predetermined expression signature of the reference sample indicates the presence of malignant cells, infected cells, or cellular disorder, in the test sample.
In one embodiment, the reference sample is known to express a predetermined expression signature indicative of a response to treatment, and the similarity of the expression signature of the test sample to the predetermined expression signature of the reference sample indicates the presence of malignant the response to a treatment in the test sample. In one embodiment, the response to treatment is an adverse response to treatment. In one embodiment, the response to treatment is a therapeutic response to treatment.
The invention further provides a method of identifying an expression signature associated with the presence or risk of cancer, infection, cellular disorder, or response to treatment. The method comprises the steps of (a) isolating cells from a group of individuals with said cancer, infection, cellular disorder, or response to treatment, and determining the expression levels of a group of genes; (b) isolating cells from a group of individuals without said cancer, infection, cellular disorder, or response to treatment, and determining the expression levels of said group of genes; and (c) identifying differentially expressed genes from said group of genes which are together indicative of the presence or risk of cancer, infection, cellular disorder, or response to treatment in an individual. Accordingly, an expression signature is identified associated with the presence or risk of cancer, infection, cellular disorder, or response to treatment. In one embodiment, the expression levels of the group of genes is determined using the solution-based method of determining expression level of target nucleic acids.
The invention further provides a method of screening for the presence of malignant cells in a test sample. The method comprises the steps of (a) determining the level of expression of a group of microRNAs in the test sample, and (b) comparing the level of expression of a group of microRNAs between the test sample and a reference sample. In one embodiment, a lower level of expression of the group of microRNAs in the test sample compared to the reference sample is indicative of the test sample containing malignant cells. In one embodiment, a similarity or difference in the level of expression of the group of microRNAs in the test sample compared to the reference sample is indicative of the test sample containing malignant cells. In one embodiment, the microRNAs are transformed into a corresponding detectable target molecule by the process of the present invention. In one embodiment, the determination of the level of microRNA in the sample is determined by the solution-based method of the present invention for determining the expression level of a population of target nucleic acids. In one embodiment, the group of microRNAs comprises at least 5 microRNAs. In one embodiment, the test sample is isolated from an individual at risk of or suspected of having cancer.
The invention further provides a method of screening an individual at risk for cancer. The method comprises the steps of (a) obtaining at least two cell samples from the individual at different times; (b) determining the level of expression of a group of microRNAs in the cell samples, and (c) comparing the level of expression of a group of microRNAs between the cell samples obtained at different times. A lower level of expression of the group of microRNAs in the later obtained cell sample compared to the earlier obtained cell sample is indicative of the individual being at risk for cancer. In one embodiment, the microRNAs are transformed into a corresponding detectable target molecule by the process of the present invention. In one embodiment, the determination of the level of microRNA in the sample is determined by the solution-based method of the present invention for determining the expression level of a population of target nucleic acids.
The invention further provides a method of identifying a microRNA expression signature associated with the presence or risk of cancer, infection, cellular disorder, or response to treatment. The method comprises the steps of (a) isolating cells from a group of individuals with said cancer, infection, cellular disorder, or response to treatment, and determining the expression levels of a group of microRNAs; (b) isolating cells from a group of individuals without said cancer, infection, cellular disorder, or response to treatment, and determining the expression levels of said group of microRNAs; and (c) identifying differentially expressed microRNAs from said group of microRNAs which are together indicative of the presence or risk of cancer, infection, cellular disorder, or response to treatment in an individual. Accordingly, a microRNA expression signature is identified associated with the presence or risk of cancer, infection, cellular disorder, or response to treatment. In one embodiment, the microRNAs are transformed into a corresponding detectable target molecule by the process of the present invention. In one embodiment, the determination of the level of microRNA in the sample is determined by the solution-based method of the present invention for determining the expression level of a population of target nucleic acids.
The invention further provides a method of classifying a tumor sample. The method comprises (a) determining the expression pattern of a group of microRNAs in a tumor sample of unknown tissue origin, generating a tumor sample profile; (b) providing a model of tumor origin microRNA expression patterns based on a dataset of the expression of microRNAs of tumors of known origin; and (c) comparing the tumor sample profile to the model to determine which tumors of known origin the sample most closely resembles. Accordingly, the tissue origin of the tumor sample is classified. In one embodiment, the determination of the level of microRNA in the sample is determined by the solution-based method of the present invention for determining the expression level of a population of target nucleic acids.
The invention further provides a method of classifying a sample from an unknown mammalian species. The method comprises the steps of (a) determining the expression pattern of a group of microRNAs in a sample of an unknown mammalian species, generating a sample profile; (b) providing a model of known mammalian species microRNA expression patterns based on a dataset of the expression of microRNAs of known mammalian species; and (c) comparing the sample profile to the model of known species to determine which known mammalian species the sample profile most closely resembles. Accordingly, the mammalian species of the sample is classified. In one embodiment, the determination of the level of microRNA in the sample is determined by the solution-based method of the present invention for determining the expression level of a population of target nucleic acids.
The invention further provides a method for identifying an active compound or molecule. The method comprises the steps of (a) contacting cells with a plurality of compounds or molecules, (b) determining the expression of a set of marker genes present in the cells using the solution-based method of the present invention for determining the expression level of a population of target nucleic acids, and (c) scoring the expression of the marker genes to identify a cellular phenotype. The presence of a specific cellular phenotype is indicative of an active compound or molecule. In one embodiment, the plurality of chemical compounds or molecules is a set of compounds or molecules selected from the group consisting of small molecule libraries, FDA approved drugs, synthetic chemical libraries, phage display libraries, dosage libraries. In one embodiment, the set of marker genes comprises genes which encode microRNAs and/or messenger RNAs. In one embodiment, the active compound is an anti-cancer drug. In one embodiment, the cellular phenotype is a tumorigenic status of the cell. In one embodiment, the cellular phenotype is a metastatic status of the cell. In one embodiment, the set of marker genes is a cancer versus non-cancer marker gene set. In one embodiment, the set of marker genes is a metastatic versus non-metastatic marker gene set. In one embodiment, he set of marker genes is a radiation resistant versus radiation sensitive marker gene set. In one embodiment, the set of marker genes is a chemotherapy resistant versus chemotherapy sensitive marker gene set. In one embodiment, the active compound is a cellular differentiation factor. In one embodiment, the cellular phenotype is a cellular differentiation status.
The invention further provides a kit for determining in solution the expression level of a population of target nucleic acids. The kit comprises: (a) a population of detectable bead sets, wherein each target-specific bead set is individually detectable and is capable of being coupled to a capture probe which corresponds to an individual target nucleic acid of interest; (b) components for transforming a target nucleic acid of interest into a corresponding detectable target molecule which will specifically bind to its corresponding individual target-specific bead set; and (c) instructions for performing the solution-based method of the present invention for determining the expression level of a population of target nucleic acids. In one embodiment, the population of target nucleic acids comprises mRNAs and the kit further comprises components for performing the method of the present invention for transforming mRNA into a corresponding detectable target molecule. In one embodiment, the population of target nucleic acids comprises microRNAs, and the kit further comprises components for performing the method of the present invention or transforming microRNA into a corresponding detectable target molecule. In one embodiment, the kit further comprises a polymerase and nucleotide bases. In one embodiment, the kit further comprises a plurality of detectable labels. In one embodiment, the kit further comprises capture probes capable of specifically hybridizing to at least 10 different microRNAs, at least 30 different microRNAs, at least 100 different microRNAs, at least 200 different target microRNAs. In one embodiment, the kit further comprises oligonucleotides for use as capture probes or oligonucleotide sequence information to design target specific probes capable of specifically hybridizing to at least 10 different target mRNAs, at least 30 different target mRNAs, at least 100 different target mRNAs, at least 200 different target mRNAs. In one embodiment, the population of target nucleic acids comprises a set of marker genes associated with the presence or risk of cancer, infection, cellular disorder, or response to treatment. In one embodiment, the sample comprises or is suspected of comprising malignant cells.
The target nucleic acid can be only a minor fraction of a complex mixture such as a biological sample. As used herein, the term “biological sample” refers to any biological material obtained from any source (e.g. human, animal, plant, bacteria, fungi, protist, virus). For use in the invention, the biological sample should contain a nucleic acid molecule. Examples of appropriate biological samples for use in the instant invention include: solid materials (e.g. tissue, cell pellets, biopsies) and biological fluids (e.g. urine, blood, saliva, amniotic fluid, mouth wash).
Nucleic acid molecules can be isolated from a particular biological sample using any of a number of procedures, which are well-known in the art, the particular isolation procedure chosen being appropriate for the particular biological sample.
The invention provides a solution-based method for highly multiplexed determination of the expression levels of a population of target nucleic acids. The population of target nucleic acids can be a collection of individual target nucleic acids of interest, such as a member of a gene expression signature or just a particular gene of interest. Each individual target nucleic acid of interest is first transformed into a detectable target molecule in a quantitative or semi-quantitative manner, such that the level of each target nucleic acid is reflected by the level of the corresponding detectable target molecule, which is labeled with a detectable signal such as a fluorescent marker. The detectable signal of the target molecule is sometimes referred to as the target molecule signal or simply as the target signal. The method also involves a population of target-specific bead sets, where each target-specific bead set is individually detectable and has a capture probe which corresponds to an individual target nucleic acid. The population of bead sets is hybridized in solution with the population of detectable target molecules to form a hybridized bead-target complex. To determine the expression level of the population of target nucleic acids present, one detects both the target signal and the bead signal for each hybridized bead-target complex, such that the level of the target signal indicates the level of expression of the target nucleic acid, and the bead signal indicates the identity of the target nucleic acid being detected. In one embodiment, the beads can be LUMINEX™ beads, which are polystyrene microspheres that are internally labeled with two spectrally distinct fluorochromes, such that each set of LUMINEX™ beads can be distinguished by its spectral address.
The methods of the invention can be used to detect any population of target nucleic acids of interest, including but not limited to DNAs and RNAs. In one preferred embodiment the target nucleic acids are messenger RNAs (mRNAs). In another preferred embodiment the target nucleic acids are microRNAs (microRNAs).
The present invention provides multiplex detection of target nucleic acids in a sample. As used herein, the phrase multiplex or grammatical equivalents refers to the detection of more than one target nucleic acid of interest within a single reaction. In one embodiment of the invention, multiplex refers to the detection of between 2-10,000 different target nucleic acids in a single reaction. As used herein, multiplex refers to the detection of any range between 2-10,000, e.g., between 5-500 different target nucleic acids in a single reaction, 25-1000 different target nucleic acids, 10-100 different target nucleic acids in a single reaction etc.
The present invention also provides high throughput detection and analysis of target nucleic acids in a sample. As used herein, the phrase “high throughput” refers to the detection or analysis of more than one reaction in a single process, where each reaction is itself a multiplex reaction, detecting more than one target nucleic acid of interest. In one preferred embodiment, 2-10,000 multiplex reactions can be processed simultaneously.
The solution-based methods of the invention use detectable target-specific bead sets which comprise a capture probe coupled to a detectable bead, where the capture probe corresponds to an individual target nucleic acid. As used herein, beads, sometimes referred to as microspheres, particles, or grammatical equivalents, are small discrete particles.
Each population of bead sets is a collection of individual bead sets, each of which has a unique detectable label which allows it to be distinguished from the other bead sets within the population of bead sets. In one embodiment, the population comprises at least 5 different individual bead sets. In another embodiment, the population comprises at least 20 different individual bead sets. The population can comprise any number of bead sets as long as there is a unique detectable signal for each bead set. For example, at least 10, 20, 30, 50, 70, 100, 200, 500 or even more different individual bead sets. In a further embodiment, the population comprises at least 1000 different individual bead sets.
Any labels or signals can be used to detect the bead sets as long as they provide unique detectable signals for each bead set within the population of bead sets to be processed in a single reaction. Detectable labels include but are not limited to fluorescent labels and enzymatic labels, as well as magnetic or paramagnetic particles (see, e.g., Dynabeads® (Dynal, Oslo, Norway)). The detectable label may be on the surface of the bead or within the interior of the bead. Detectable labels for use in the invention are described in greater detail below.
The composition of the beads can vary. Suitable materials include any materials used as affinity matrices or supports for chemical and biological molecule syntheses and analyses, including but not limited to: polystyrene, polycarbonate, polypropylene, nylon, glass, dextran, chitin, sand, pumice, agarose, polysaccharides, dendrimers, buckyballs, polyacrylamide, silicon, rubber, and other materials used as supports for solid phase syntheses, affinity separations and purifications, hybridization reactions, immunoassays and other such applications.
Typically the beads have at least one dimension in the 5-10 mm range or smaller. The beads can have any shape and dimensions, but typically have at least one dimension that is 100 mm or less, for example, 50 mm or less, 10 mm or less, 1 mm or less, 100 μm or less, 50 μm or less, and typically have a size that is 10 μm or less such as, 1 μm or less, 100 nm or less, and 10 nm or less. In one embodiment, the beads have at least one dimension between 2-20 μm. Such beads are often, but not necessarily, spherical e.g. elliptical. Such reference, however, does not constrain the geometry of the matrix, which can be any shape, including random shapes, needles, fibers, and elongated. Roughly spherical, particularly microspheres that can be used in the liquid phase, also are contemplated. The beads can include additional components, as long as the additional components do not interfere with the methods and analyses herein.
Commercially available beads which can be used in the methods of the invention include but are not limited to bead-based technologies available from LUMINEX™, Illumina, and Lynx. In one embodiment provides microbeads labeled with different spectral property and/or fluorescent (or colorimetric) intensity. For example, polystyrene microspheres are provided by LUMINEX™ Corp, Austin, Tex. that are internally dyed with two spectrally distinct fluorochromes. Using precise ratios of these fluorochromes, a large number of different fluorescent bead sets (e.g., 100 sets) can be produced. Each set of the beads can be distinguished by its spectral address, a combination of which allows for measurement of a large number of analytes in a single reaction vessel. In this embodiment, the detectable target molecule is labeled with a third fluorochrome. Because each of the different bead sets is uniquely labeled with a distinguishable spectral address, the resulting hybridized bead-target complexes will be distinguishable for each different target nucleic acid, which can be detected by passing the hybridized bead-target complexes through a rapidly flowing fluid stream. In the stream, the beads are interrogated individually as they pass two separate lasers. High speed digital signal processing classifies each of the beads based on its spectral address and quantifies the reaction on the surface. Thousands of beads can interrogated per second, resulting a high speed, high throughput and accurate detection of multiple different target nucleic acids in a single reaction.
In addition to a detectable label, the bead sets also contain a capture probe which corresponds to an individual target nucleic acid. Typically, the capture probes are short unique DNA sequences with uniform hybridization characteristics. Useful capture probes of the invention are described in detail below.
The capture probe can be coupled to the beads using any suitable method which generates a stable linkage between probe and the bead, and permits handling of the bead without compromising the linkage using further methods of the invention. Coupling reactions include but are not limited to the use capture probes modified with a 5′ amine for coupling to carboxylated microsphere or bead.
Methods to Transform a Target mRNA into a Detectable Target Molecule
In one preferred embodiment, the present invention provides methods to detect a population of target nucleic acids, where the target nucleic acids are mRNAs, as illustrated in
To detect a nucleic acid, for example, mRNAs, the invention provides methods to transform a mRNA into a corresponding detectable target molecule. However, any nucleic acid can be used, e.g., DNA, microRNA, etc. In this example, the mRNA target nucleic acid is first reverse transcribed to generate a cDNA, which is then amplified. During the amplification reaction, a detectable signal is also introduced to create a detectable target molecule, sometimes referred to as a tagged or detectable amplicon. In this process, an upstream probe and a downstream probe are first hybridized to the cDNA. The upstream probe comprises a universal upstream sequence and an upstream target-specific sequence, and the downstream probe comprises a universal downstream sequence and a downstream target-specific sequence, such that when the upstream probe and the downstream probe are both hybridized to the cDNA, the two probes are capable of being ligated, as illustrated in
The target-specific sequences of the upstream and the downstream probes comprise polynucleotide sequences that are complementary to a portion of the polynucleotide sequence of the target nucleic acid of interest. Preferably, the target-specific sequences of the present invention are completely complimentary to their corresponding target sequence in the nucleic acid of interest. However, the target-specific sequences used in the present invention can have less than exact complementarity with their target sequences, as long as the upstream and downstream probes hybridized to the target sequence can be ligated by a DNA ligase.
To allow hybridization to the capture probe of the corresponding bead set, a sequence which is complementary to the capture probe must be present in the detectable target molecule. For the detection and analysis of mRNA, this sequence is sometimes referred to as the amplicon tag. The amplicon tag may be a sequence within the target nucleic acid-specific sequence, i.e. part of the upstream or downstream target specific sequences. Alternatively, either the upstream probe or the downstream probe may additionally contain an amplicon tag, which lies between the universal sequence and the target specific sequence of the probe. For example, if the amplicon tag resides within the upstream probe, then it is between the upstream universal sequence and the upstream target specific sequence.
Methods to Transform a microRNA into a Detectable Target Molecule
The present invention also provides methods to detect other nucleic acid, such as a population of microRNAs. The detection of microRNAs represents a significant problem in the art because of their size and sequence similarities. microRNAs are a recently identified class of small non-coding RNAs, which are typically around 21 nucleotides and may differ in sequence by only one or a few nucleotides. At present, hundreds of distinct microRNAs have been identified; however, new microRNAs continue to be described.
Mature microRNAs are excised from a stem-loop precursor that itself can be transcribed as part of a longer primary RNA, sometimes referred to as pri-microRNA. The pri-microRNA is then processed by a nuclear RNAse, cleaving the base of the stem-loop and defining one end of the microRNA. Following export to the cytoplasm, the precursor microRNA is further processed by a second RNAse which cleaves both strands of the RNA, typically about 22 nucleotides from the base of the stem. The two strands of the resulting double-stranded RNA are differentially stable, and the mature microRNA resides on the more stable strand. See Lee, EMBO J. 21:4663-70 (2002); Lee, Nature 425:415-19 (2003); Yi, Genes Dev. 17:17:3011-16 (2003); Lund, Science 303:95-8 (2004); Khvorova, Cell 115:209-16 (2003); and Schwarz, Cell 115:199-208 (2003).
To detect a population of microRNAs, the invention provides methods to transform a microRNA into a corresponding detectable target molecule using essentially the method previously described in Miska et al., Genome Biology 5:R68 (2004). In this method, one first ligates at least one adaptor to the population of microRNAs, generating a population of ligated adaptor-microRNA molecules. These ligated molecules are then detectably labeled, thereby generating a detectable target molecule which corresponds to the specific microRNA. In one embodiment, the adaptor-microRNA is detectably labeled by reverse transcription using the adaptor-microRNA as a template for polymerase chain reaction. At least one of the primers used in said polymerase chain reaction is detectably labeled. Detectable labels are described in detail below.
More particularly, the method involves first size selecting 18-26 nucleotide RNAs from total RNA, for example using denaturing polyacrylamide gel electrophoresis (PAGE). Oligonucleotides are then attached to the 5′ and 3′ ends of the small RNAs to generate ligated small RNAs. The ligated small RNAs are then used as templates for reverse transcription PCR, as previously described for microRNA cloning. See Lee, Science 294:862-4 (2001); Lagos-Quintana, Science 294:853-8 (2001); Lau, Science 294:858-62 (2001). The RT-PCR can include for example 10 cycles of amplification. To detectably label the resulting amplification product, either of the primers used for the RT-PCR reaction can have a detectable label, such as a fluorophore such as Cy3. Preferably, the detectable label is attached to the 5′ end of the primer.
The adaptors of the present invention are comprised of nucleic acid sequences typically not found in the population of microRNAs. Preferably, there is less than 35% identity (homology) between the adaptor sequence and the template, more preferably less than 30% identity, still more preferably less than 25% identity. The sequence analysis programs used to determine homology are run at the default setting.
To specifically identify individual microRNAs, the invention provides a population of bead sets where the capture probes are complementary to the microRNA sequences themselves, rather than the adaptor sequences. Thus, the invention provides in certain embodiments a populations of bead sets which are specific to all known microRNAs. As microRNAs continue to be discovered, the invention allows ready addition of new bead sets corresponding to the newly discovered microRNAs to be added. As discussed in detail below, the invention also provides specific sets of populations of bead sets for the expression profiling of signature microRNAs.
As described above, the probes, primers, and adaptors of the invention comprise include but are not limited to the capture probes of the bead sets, universal primers for amplification of the ligation complexes for nucleic acid detection such as mRNA detection, adaptors for the detection of different nucleic acids such as microRNAs, and amplicon tags for hybridization of the detectable target molecules to the capture probes of the bead sets. The invention also provides additional primers, probes, and adaptors for use in various nucleic acid manipulations. The probes, primers and adaptors are sometimes referred to simply as primers.
The probes, primers, and adaptors used in the methods of the invention can be readily prepared by the skilled artisan using a variety of techniques and procedures. For example, such probes, primers, and adaptors can be synthesized using a DNA or RNA synthesizer. In addition, probes, primers, and adaptors may be obtained from a biological source, such as through a restriction enzyme digestion of isolated DNA. Preferably, the primers are single-stranded.
As used herein, the term “primer” has the conventional meaning associated with it in standard PCR procedures, i.e., an oligonucleotide that can hybridize to a polynucleotide template and act as a point of initiation for the synthesis of a primer extension product that is complementary to the template strand.
Preferably, the primers of the present invention have exact complementarity with its target sequence. However, primers used in the present invention can have less than exact complementarity with their target sequence as long as the primer can hybridize sufficiently with the target sequence so as to function as described; for example to be extendible by a DNA polymerase or for hybridization with the capture probe of the bead set.
For use in a given multiplex reaction, the universal primer sequences are typically analyzed as a group to evaluate the potential for fortuitous dimer formation between different primers. This evaluation may be achieved using commercially available computer programs for sequence analysis, such as Gene Runner, Hastings Software Inc. Other variables, such as the preferred concentrations of Mg+2, dNTPs, polymerase, and primers, are optimized using methods well-known in the art (Edwards et al., PCR Methods and Applications 3:565 (1994)).
Any labels or signals which allow detection of the bead set and the detectable target molecules can be used in the methods of the invention. Such detectable labels are well known in the art.
According to the invention, there is a target-specific bead set which corresponds to each target nucleic acid of interest. For each bead set there is a detectable signal, and for the corresponding target nucleic acid there is a distinct detectable signal. Thus, detection of an individual target nucleic interest requires two distinguishable detectable signals.
The detectable labels of the invention may be added to the target nucleic acid and/or the bead sets using various methods. The detectable label may be covalently conjugated with the nucleic acid or non-covalently attached to the nucleic through sequence-specific or non-sequence-specific binding. Examples of the detectable labels include, but are not limited to biotin, digoxigenin, fluorescent molecule (e.g., fluorescin and rhodamine), chemiluminescent moiety (e.g., LUMINOL™), coenzyme, enzyme substrate, radio isotopes, a particle such as latex or carbon particle, nucleic acid-binding protein, polynucleotide that specifically hybridizes with either the target or reference nucleic acid strand. Detection of the presence of the label can be achieved by observation or measurement of signals emitted from the label. The production of the signal may be facilitated by binding of the label to its counter-part molecule, which triggers a reaction directly or indirectly. For example, the target nucleic acid may be labeled with biotin; upon binding of streptavidin-HRP (horse radish peroxidase) and addition of the substrate for HRP (e.g., ABTS), the presence of the biotin-labeled target molecule can be detected by observing or measuring color changes in the mixture.
In certain preferred embodiments, the labels are fluorescent and the hybridized bead-target complexes are detected using fluorescence polarization machine, also referred to as a flow cytometer. Fluorescent dyes with diverse spectral properties (e.g., as supplied by MOLECULAR PROBES™, Eugene, Oreg.) may be used to simultaneously detect multiple detectable target molecules. In this assay, each target molecules may be labeled with a fluorescent dye having different spectral property than that for another target molecule. In another preferred embodiment, the detectable target molecule is labeled with a biotin, and the final hybridized bead-target complexes are further reacted with a signal such as streptavidin-phycoerythrin.
In the present invention, a target nucleic acid refers to a sequence of nucleotides to be studied either for the presence of a difference from a reference sequence or for the determination of its presence or absence. The target nucleic acid sequence may be double stranded or single stranded and from a natural or synthetic source. When the target nucleic acid sequence is single stranded, a nucleic acid duplex comprising the single stranded target nucleic acid sequence may be produced by primer-extension and/or amplification.
The present invention is preferably used with at least 5 targets in a single reaction, more preferably at least 10 targets, still more preferably with at least 14 targets, even more preferably with at least 20 targets, yet more preferably with at least 30 targets, still more preferably with at least 50 targets, and even more preferably with at least 100 targets in a single reaction, although one can target any number from 5-1000 as long as a uniquely detectable signal is used. Multiplex detection as used herein refers to the simultaneous detection of multiple nucleic acid targets in a single reaction mixture.
High-throughput denotes the ability to simultaneously process and screen a large number of individual reaction mixtures such as multiplexed nucleic acid samples (e.g. in excess of 100 RNAs) in a rapid and economical manner, as well as to simultaneously screen large numbers of different target nucleic acids within a single multiplexed nucleic acid sample.
Any nucleic acid sample of interest may be used in practicing the present invention, including without limitation eukaryotic, prokaryotic and viral DNA or RNA. In a preferred embodiment, the target nucleic acids represents a sample of total RNA, including mRNA and microRNA, isolated from an individual. This DNA may be obtained from any cell source or body fluid. Non-limiting examples of cell sources available in clinical practice include blood cells, buccal cells, cervicovaginal cells, epithelial cells from urine, fetal cells, or any cells present in tissue obtained by biopsy. Body fluids include blood, urine, cerebrospinal fluid, semen and tissue exudates at the site of infection or inflammation. Nucleic acid such as RNA is extracted from the cell source or body fluid using any of the numerous methods that are standard in the art. It will be understood that the particular method used to extract the nucleic acid will depend on the nature of the source and the type of nucleic acid to be extracted.
The present method can be used with polynucleotides comprising either full-length RNA or DNA, or their fragments. The RNA or DNA can be either double-stranded or single-stranded, and can be in a purified or unpurified form. Preferably, the polynucleotides are comprised of RNA. In certain embodiments, the present invention can be used with full-size cDNA polynucleotide sequences, such as can be obtained by reverse transcription of RNA. The DNA fragments used in the present invention can be obtained by digestion of cDNA with restriction endonucleases, or by amplification of cDNA fractions from cDNA using arbitrary or sequence-specific PCR primers. The nucleic acid can be obtained from a variety of sources, including both natural and synthetic sources. The nucleic acid can be from any natural source including viruses, bacteria, yeast, plants, insects and animals.
Certain embodiments of the invention provide amplification of a nucleic acid using polymerase chain reaction (PCR). “Amplification” of DNA as used herein denotes the use of polymerase chain reaction (PCR) to increase the concentration of a particular DNA sequence within a mixture of DNA sequences. In practicing the present invention, a nucleic acid sample is contacted with pairs of oligonucleotide primers under conditions suitable for polymerase chain reaction. Conditions for performing PCR are well known in the art. Standard PCR reaction conditions may be used, e.g., 1.5 mM MgCl.sub.2, 50 mM KCl, 10 mM Tris-HCl, pH 8.3, 200 μM deoxynucleotide triphosphates (dNTPs), and 25-100 U/ml Taq polymerase (PERKIN-ELMER™, Norwalk, Conn.). The concentration of each primer in the reaction mixture can range from about 0.05 to about 4 μM. Each potential primer can be evaluated by performing single PCR reactions using each primer pair (e.g. a universal upstream primer and a universal downstream primer) individually. Similarly, each primer pair can be evaluated independently to confirm that all primer pairs to be included in a single multiplex PCR reaction generate a product of the expected size. As the number of targets in a single reaction increases, certain targets may not be amplified as efficiently as other targets. The concentration of the primers for such underrepresented targets may be increased to increase their yield. For example, when multiplying 15 or more targets; more preferably, when multiplying 30 or more targets.
Multiplex PCR reactions are typically carried out using manual or automatic thermal cycling. Any commercially available thermal cycler may be used, such as, e.g., PERKIN-ELMER™ 9600 cycler.
A variety of DNA polymerases can be used during PCR with the present invention. Preferably, the polymerase is a thermostable DNA polymerase such as may be obtained from a variety of bacterial species, including Thermus aquaticus (Taq), Thermus thermophilus (Tth), Thermus filiformis, Thermus flavus, Thermococcus literalis, and Pyrococcus furiosus (Pfu). Many of these polymerases may be isolated from the bacterium itself or obtained commercially. Polymerases to be used with the present invention can also be obtained from cells which express high levels of the cloned genes encoding the polymerase. Preferably, a combination of several thermostable polymerases can be used.
The PCR conditions used to amplify the targets are standard PCR conditions which are well known in the art. Typical conditions use 35-40 cycles, with each cycle comprising a denaturing step (e.g. 10 seconds at 94° C.), an annealing step (e.g. 15 sec at 68° C.), and an extension step (e.g. 1 minute at 72° C.). As the number of targets in a single reaction increases, the length of the extension time may be increased. For example, when amplifying 30 or more targets, the extension time may be three times as longer than when amplifying 10-15 targets (e.g. 3 minutes instead of 1 minute).
In addition to the detection methods specific to the present invention, the reaction products can be analyzed using any of several methods that are well-known in the art, for example to confirm isolated steps of the methods. For example, agarose gel electrophoresis can be used to rapidly resolve and identify each of the amplified sequences. In a multiplex reaction, different amplified sequences are preferably of distinct sizes and thus can be resolved in a single gel. In one embodiment, the reaction mixture is treated with one or more restriction endonucleases prior to electrophoresis. Alternative methods of product analysis include without limitation dot-blot hybridization with allele-specific oligonucleotides and SSCP.
The methods of the invention can be used in any application or method in which it is desirable to measure or detect the presence of a population of target nucleic acids, such as for gene expression profiling or microRNAs profiling. While several preferred applications are described in detail here, the invention is in no way limited to these embodiments. Other applications would become apparent to one skilled in the art having the benefit of this disclosure.
As described in detail below, the invention can be used in methods for gene expression profiling assays such as, diagnostic and prognostic assays, for example for gene expression signatures, molecule or genetic library screening, such as screening cDNA/ORFs, shRNAs, and microRNAs, pharmacogenomics, and the classification of induced biological states.
The methods of the invention are useful for a variety of gene expression profiling applications. More particularly, the invention encompasses methods for high-throughput genetic screening. The method allows the rapid and simultaneous detection of multiple defined target nucleic acids such as mRNA or microRNA sequences in nucleic samples obtained from a multiplicity of individuals. It can be carried out by simultaneously amplifying many different target sequences from a large number of desired samples, such as patient nucleic acid samples, using the methods described above.
In general, as used herein, an expression signature is a set of genes, where the expression level of the individual genes differs between a first physiological state or condition relative to their expression level in a second physiological state or condition, i.e. state A and state B. For example, between cancerous cells and non-cancerous cells, or cells infected with a pathogen and uninfected cells, or cells in different states of development.
The terms “differentially expressed gene,” “differential gene express” and their synonyms, which are used interchangeably, refer to a gene whose expression is activated to a higher or lower level in one physiological state relative to a second physiological subject suffering from a disease, such as cancer, relative to its expression in a normal or control subject. As used herein, “gene” specifically includes nucleic acids which do not encode proteins, such as microRNAs. The terms also include genes whose expression is activated to a higher or lower level at different states of the same disease. A differentially expressed gene may be either activated or inhibited at the nucleic acid level or protein level, or may be subject to alternative splicing to result in a different polypeptide product. Such differences may be evidenced by a change in mRNA levels or microRNA levels, surface expression, secretion or other partitioning of a polypeptide, for example. Differential gene expression may include a comparison of expression between two or more genes or their gene products, or a comparison of the ratios of the expression between two or more genes or their gene products, or even a comparison of two differently processed products of the same gene, which differ between normal subjects and subjects suffering from a disease, specifically cancer, or between various stages of the same disease. Differential expression includes both quantitative, as well as qualitative, differences in the temporal or cellular expression pattern in a gene or its expression products among, for example, normal and diseased cells, or among cells which have undergone different disease events or disease stages. Differential gene expression is considered to be present when there is at least an about two-fold, preferably at least about four-fold, more preferably at least about six-fold, more preferably at least about ten-fold difference between the expression of a given gene between two different physiological states, such as in various stages of disease development in a diseased individual.
An expression signature is sometimes referred to herein as a set of marker genes. An expression signature, or set of marker genes, is a minimum number of genes that is capable of identifying a phenotypic state of a cell. A set of marker genes that is representative of a cellular phenotype is one which includes a minimum number of genes that identify markers to demonstrate that a cell has a particular phenotype. In general, two discrete cell populations in different physiological states having the desired phenotypes may be examined by the methods of the invention. The minimum number of genes in a set of marker genes will depend on the particular phenotype being examined. In some embodiments the minimum number of genes is 2 or, more preferably, 5 genes. In other embodiments, the minimum number of genes is 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 1000 genes.
One embodiment of the invention provides highly practical, i.e. low cost, high throughput, and highly flexible routine mRNA expression analysis, for example for clinical testing. The invention provides methods to analyze the expression signature for a cellular phenotype of interest by determining the expression level of a set of marker genes in a test sample. A “phenotype” as used herein refers to a physiological state of a cell under a specific set of conditions, including but not limited to malignancy, infection or a cellular disorder.
In general, analysis of an expression signature involves first determining the expression profile of a gene group, also known as the expression signature, in the test sample, and comparing the expression profile between the test sample and a corresponding control sample, where a difference in the expression profile between the test sample and the control sample is indicative of the test sample expressing the physiological state or cellular phenotype associated with the signature profile. There can be a range of differences in gene expression in the expression profile between the control sample and the profile of interest. Preferably, there are differences from the control profile in at least 25% of the genes being looked at. This can range from a sample showing a 25% change to 100% change from the control sample pattern to the condition of interest and all points in between, at least 30%, at least 40%, at least 50%, at least 75%, at least 90%.
The methods of the invention can be used to analyze any expression signature for a cellular phenotype of interest. The identification of expression signatures is the subject of intense study. The invention contemplates the analysis of any expression signature of interest and is in no way limited to the specific embodiments described herein.
In one embodiment, the present invention provides methods to measure gene expression signatures in a sample, where the expression signature is indicative of a malignancy. For example, van de Vivjer et al. New Engl. J. Med. 347: 1999-2009 (2002) described a 70 member expression signature associated with breast cancer malignancy or metastasis, and is a predictor of survival. U.S. Patent Application Publication No. 2004/0018527 discloses a group of 91 genes associated with docetaxel chemosensitivity in breast cancer. Additional breast cancer expression signatures are described in detail in U.S. Patent Application Publication No. 2004/0058340 as well as Abba et al., BMC Genomics 6:37 (2005). Glas et al. (2005) described an 81 member expression signature associated with follicular lymphoma, particularly the aggressiveness of the lymphoma. Stegmaier et al. (2004) described a 5 member expression signature which was used in a cell-based small molecule screen for agents inducing the differentiation of human leukemia cells. U.S. Patent Application Publication No. 2004/0009523 discloses 14 genes associated with a diagnosis of multiple myeloma, as well as four subgroups of 24 genes associated with a prognosis of multiple myeloma. U.S. Patent Application Publication No. 2005/0089895 discloses 26 genes associated with the likelihood of recurrence in hepatocellular carcinoma. O'Donnell et al., 2005, Oncogene 24:1244-51, described a group of 116 genes associated with squamous cell carcinoma of the oral cavity. Beer et al. 2002, Nat Med 8:816-824 discloses 50 gene risk index associated with lung adenocarcinoma survival. Classification of human lung cancer by gene expression profiling has been described in several recent publications (M. Garber, PNAS, 98(24): 13784-13789 (2001); A. Bhattacharjee, PNAS, 98(24):13790-13795 (2001). Ramaswamy et al., 2002, Nat Gen 33:49-54 discloses 128 genes whose relative expression levels distinguish between primary and metastatic tumors. Glinsky et al., 2005, J. Clin. Invest. 115:1503-21, discloses 11 genes associated with highly aggressive disease outcomes for several different cancers.
Other disease conditions have also been found to be associated with expression signatures. For example, U.S. Patent Application Publication No. 20040220125 discloses 40 cardioprotective genes, which are useful as a means to diagnose cardiopathology. Baechler et al. 2003, PNAS 100:2610-15 disclose a group of 161 genes associated with severe lupus; see also U.S. Patent Application Publication No. 2004/0033498.
Other cellular states for which expression signatures have been reported include apoptosis, for which a set of 35 regulator genes has been reported (Eldering et al., Nuc. Acid Res. 31:e153 (2003), as well as inflammation, which was associated with a group of 30 genes (Id.).
The present invention also provides methods for diagnosis of infection by gene expression profiling using the methods of the invention. In one embodiment, the expression signature is comprised of cellular host genes whose expression is altered in the presence of an infectious agent. For example, U.S. Patent Application Publication No. 20040038201 discloses expression signatures of cellular host genes associated with infection with a variety of infectious agents, including E. coli, the enterohemorrhagic pathogen E. coli 0157:H7, Salmonella spp. Staphylococcus aureus, Listeria monocytogenes, M. tuberculosis, and M. bovis bacilli Calmette-Gurin (BCG).
In another embodiment, the expression signature is comprised of genes of the infectious agent. The expression signature can also comprise a combination of host and infectious agent genes.
Another preferred embodiment of the invention provides methods for screening for the presence of an infection in a sample by detecting the presence of multiple genes associated with the infectious agent. Viruses, bacteria, fungi and other infectious organisms contain distinct nucleic acid sequences, which are different from the sequences contained in the host cell. Detecting or quantifying nucleic acid sequences that are specific to the infectious organism is important for diagnosing or monitoring infection. Examples of disease causing viruses that infect humans and animals and which may be detected by the disclosed processes include but are not limited to: Retroviridae (e.g., human immunodeficiency viruses, such as HIV-1 (also referred to as HTLV-III, LAV or HTLV-III/LAV, See Ratner, L. et al., Nature, Vol. 313, Pp. 227-284 (1985); Wain Hobson, S. et al, Cell, Vol. 40: Pp. 9-17 (1985)); HIV-2 (See Guyader et al., Nature, Vol. 328, Pp. 662-669 (1987); European Patent Publication No. 0 269 520; Chakraborti et al., Nature, Vol. 328, Pp. 543-547 (1987); and European Patent Application No. 0 655 501); and other isolates, such as HIV-LP (International Publication No. WO 94/00562 entitled “A Novel Human Immunodeficiency Virus”; Picornaviridae (e.g., polio viruses, hepatitis A virus, (Gust, I. D., et al., Intervirology, Vol. 20, Pp. 1-7 (1983); entero viruses, human coxsackie viruses, rhinoviruses, echoviruses); Calciviridae (e.g., strains that cause gastroenteritis); Togaviridae (e.g., equine encephalitis viruses, rubella viruses); Flaviridae (e.g., dengue viruses, encephalitis viruses, yellow fever viruses); Coronaviridae (e.g., coronaviruses); Rhabdoviridae (e.g., vesicular stomatitis viruses, rabies viruses); Filoviridae (e.g., ebola viruses); Paramyxoviridae (e.g., parainfluenza viruses, mumps virus, measles virus, respiratory syncytial virus); Orthomyxoviridae (e.g., influenza viruses); Bungaviridae (e.g., Hantaan viruses, bunga viruses, phleboviruses and Nairo viruses); Arena viridae (hemorrhagic fever viruses); Reoviridae (e.g., reoviruses, orbiviurses and rotaviruses); Birnaviridae, Hepadnaviridae (Hepatitis B virus); Parvoviridae (parvoviruses); Papovaviridae (papilloma viruses, polyoma viruses); Adenoviridae (most adenoviruses); Herpesviridae (herpes simplex virus (HSV) 1 and 2, varicella zoster virus, cytomegalovirus (CMV), herpes viruses); Poxyiridae (variola viruses, vaccinia viruses, pox viruses); and Iridoviridae (e.g., African swine fever virus); and unclassified viruses (e.g., the etiological agents of Spongiform encephalopathies, the agent of delta hepatitis (thought to be a defective satellite of hepatitis B virus), the agents of non-A, non-B hepatitis (class 1=internally transmitted; class 2=parenterally transmitted (i.e., Hepatitis C); Norwalk and related viruses, and astroviruses).
Examples of infectious bacteria include but are not limited to: Helicobacter pyloris, Borelia burgdorferi, Legionella pneumophilia, Mycobacteria sps (e.g. M. tuberculosis, M. avium, M. intracellulare, M. kansaii, M. gordonae), Staphylococcus aureus, Neisseria gonorrhoeae, Neisseria meningitidis, Listeria monocytogenes, Streptococcus pyogenes (Group A Streptococcus), Streptococcus agalactiae (Group B Streptococcus), Streptococcus (viridans group), Streptococcus faecalis, Streptococcus bovis, Streptococcus (anaerobic sps.), Streptococcus pneumoniae, pathogenic Campylobacter sp., Enterococcus sp., Haemophilus influenzae, Bacillus antracis, corynebacterium diphtheriae, corynebacterium sp., Erysipelothrix rhusiopathiae, Clostridium perfringers, Clostridium tetani, Enterobacter aerogenes, Klebsiella pneumoniae, Pasturella multocida, Bacteroides sp., Fusobacterium nucleatum, Streptobacillus moniliformis, Treponema pallidium, Treponema pertenue, Leptospira, and Actinomyces israelli.
Examples of parasitic protozoan infections include but are not limited to: Plasmodium vivax, Plasmodium ovale, Plasmodium malariae, Plasmodium falciparum, Toxoplasma gondii, Pneumocystis carinii, Trypanosoma cruzi, Trypanasoma brucei gambiense, Trypanasoma brucei rhodesiense, Leishmania species, including Leishmania donovani, Leishmania mexicana, Naegleria, Acanthamoeba, Trichomonas vaginalis, Cryptosporidium species, Isospora species, Balantidium coli, Giardia lamblia, Entamoeba histolytica, and Dientamoeba fragilis. See generally, Robbins et al, Pathologic Basis of Disease (Saunders, 1984) 273-75, 360-83.
microRNA Expression Profiles
We have also found that one can screen for the presence of malignant cells in a test sample by determining the level of expression of total microRNAs in a test sample; and comparing the levels of expression of microRNAs of the test sample and a control sample. A lower level of expression of microRNAs in the test sample compared to the control sample is indicative of the test sample containing malignant cells. One can use any screening method including the solution base method described herein, or other known methods such as micorarrays for microRNAs, such as that described in Miska et al., 2004.
Another embodiment of the invention provides methods of screening an individual at risk for cancer by obtaining at least two cell samples from the individual at different times; and comparing the level of expression of microRNAs in the cell samples, where a lower level of expression of microRNAs in the later obtained cell sample compared to the earlier obtained cell sample indicates that the individual is at risk for cancer.
In one preferred embodiment, the methods of the present invention are useful for characterizing poorly differentiated tumors. As exemplified herein, microRNA expression distinguishes tumors from normal tissues, even for poorly differentiated tumors. As shown in
The methods of detecting microRNAs are particularly useful for detecting tumors of histologically uncertain cellular origin, which account for 2-4% of all cancer diagnoses. In this embodiment, the expression profile of microRNAs in a tumor of uncertain cellular origin is compared to a set of microRNA expression profiles for a set of tumors of known origin, allowing classification of the test samples to be assessed based on the comparison.
In another embodiment, the level of expression for a specific group of microRNAs, sometimes referred to a profile group of microRNAs, is determined, where lower expression of said profile group of microRNAs is associated with risk for a particular type of cancer. In particular, microRNAs can be used to classify acute lymphoblastic leukemias into the following subclassifications: t(9;22) BCR/ABL ALLs; t(12;21) TEL/AML1 ALLs; and T-cell ALLs.
We have also discovered methods for identifying an expression profile of a gene group associated with risk of a cellular disorder. It can be any type of nucleic acid that is viewed. In certain embodiments, the genes encode mRNAs. In other preferred embodiments, the genes encode microRNAs.
In one embodiment, the methods involve the establishment of two or more sets of gene expression profiles. The gene expression profiles are utilized to develop marker gene sets which identify a phenotype. Thus, the methods of the invention involve the identification of a cell signature which is useful for identifying a phenotype of a cell.
As used herein, a control gene or set of control genes is selected that are common between the two physiological states in similar or equivalent degrees of gene expression. Additionally, a common housekeeping gene(s) may be used as an “internal” reference or control to normalize the readout for relative differences in cell populations in the screening assay. One example of a common gene useful in the invention is glyceraldehyde 3-phosphate dehydrogenase (GAPDH) (M33197). The expression level of the marker genes will define the phenotypic state when taken in ratio to the common gene(s). Hence, quantitation of the expression levels for 2 or more marker genes will be adequate to identify a new phenotypic state.
In this method, one isolates cells from a group of individuals with a cancer, infection, or cellular disorder, and determining the expression level of multiple genes; isolating cells from a group of individuals without said cancer, infection, or cellular disorder, and determining the expression level of said multiple genes; and identifying differential gene expression patterns that are statistically significant; and applying linear regression analysis to identify an expression profile of a gene group that is indicative of an individual having risk of said cancer, infection, or cellular disorder. One can use any screening technique to identify the expression profile. The method described herein is particularly useful because of the flexibility it provides in selecting beads that suit a specific profile.
The present invention also provides methods to screen a library to identify molecules that change the profile of a cell to result in a desired result. The methods of multiplex target nucleic acid detection are particularly useful in methods for drug screening, such as those disclosed in U.S. Published Patent Application No. 2004/0009495, which is hereby incorporated herein in its entirety.
In this method, the effect of a molecule such as a small molecule protein, etc. on the expression profile signature is used to identify small molecules of interest. For example, one can screen for molecules which alter an expression signature associated with a biological state, such as cancer, such that the expression signature of a sample exposed to the small molecule is altered to more closely resemble the healthy state, i.e. a non-cancerous state. One would look for molecules that change the profile of at least 25% of the genes in the profiling to a profile of the healthy cell. In. other embodiments, one looks for molecules or groups of molecules that result in a change of the expression profile of at least 30$, at least 40%, at least 50%, at least 60%, at least 75%, at least 80%, at least 90% until one gets virtual identity with the desired state.
In another embodiment, one can also screen from molecules that cause an undesired condition by looking at how an expression profile is changes from the desired profile to an undesired profile. The present methods can also be used to monitor when a patient should get therapy, what therapy and the effect of that therapy. For example, in pharmacogenomics applications and methods, including the use of gene expression signatures to predict response to therapy. Such applications can be deployed on this platform providing a practical (i.e. low cost, high throughput) mRNA expression based tool to inform treatment decisions or enrollment in clinical trials.
The screening methods may be used for identifying therapeutic agents or validating the efficacy of agents. Agents of either known or unknown identity can be analyzed for their effects on gene expression in cells using methods such as those described herein. Briefly, purified populations of cells are exposed to the plurality of chemical compounds, preferably in an in vitro culture high throughput setting, and optionally after set periods of time, the entire cell population or a fraction thereof is removed and mRNA is harvested therefrom. Any target nucleic acids, such as mRNAs or microRNAs, are then analyzed for expression of marker genes using methods such as those described herein. Hybridization or other expression level readouts may be then compared to the marker gene data. These methods can be used for identifying novel agents, as well as confirming the identity of agents that are suspected of playing a role in regulation of cellular phenotype.
The methods of the invention allows for subjects to be screened and potentially characterized according to their ability to respond to a plurality of drugs. For instance, cells of a subject, e.g., cancer cells, may be removed and exposed to a plurality of putative therapeutic compounds, e.g., anti-cancer drugs, in a high throughput manner. The nucleic acids of the cells may then be screened using the methods described herein to determine whether marker genes indicative of a particular phenotype are expressed in the cells. These techniques can be used to optimize therapies for a particular subject. For instance, a particular anti-cancer therapy may be more effective against a particular cancer cell from a subject. This could be determined by analyzing the genes expressed in response to the plurality of compounds. Likewise a therapeutic agent with minimal side effects may be identified by comparing the genes expressed in the different cells with a marker gene set that is indicative of a phenotype not associated with a particular side effect. Additionally, this type of analysis can be used to identify subjects for less aggressive, more aggressive, and generally more tailored therapy to treat a disorder.
The methods are also useful for determining the effect of multiple drugs or groups of drugs on a cellular phenotype. For instance it is possible to perform combined chemical genomic screens to identify a synergistic or other combined effect arising from combinations of drugs. One set of drugs that induces a first set of marker genes indicative of a phenotype, while another drug induces an second set of marker genes. When the two sets of drugs are combined they may act to achieve a collective phenotypic change, exemplified by a third set of marker genes. Additionally the methods could be used to assess complex multidrug effects on cell types. For instance, some drugs when used in combination produce a combined toxic effect. It is possible to perform the screen to identify marker genes associated with the toxic phenotype. Existing compounds could be screened for there ability to “trip” the signal signature of toxic effect, by monitoring the marker genes associated with the toxic phenotype.
The methods may also be used to enhance therapeutic strategies. For instance, oncolytic therapy involves the use of viruses to selectively lyse cancer cells. A set of marker genes which identify a gene expression signature favorable to selective viral infection can be identified. Using this set of marker genes, drugs can be found which favor or enable selective viral infectivity in order to enhance the therapeutic benefit.
Thus, the methods of the invention are useful for screening multiple compounds. For instance, the methods are useful for screening libraries of molecules, FDA approved drugs, and any other sets of compounds. Preferably the methods are used to screen at least 20 or 30 compounds, and more preferably, at least 50 compounds. In some embodiments, the methods are used to screen more than 96, 384, or 1536 compounds at a time.
In one embodiment, the methods of the invention are useful for screening FDA approved drugs. An FDA approved drug is any drug which has been approved for use in humans by the FDA for any purpose. This is a particularly useful class of compounds to screen because it represents a set of compounds which are believed to be safe and therapeutic for at least one purpose. Thus, there is a high likelihood that these drugs will at least be safe and possibly be useful for other purposes. FDA approved drugs are also readily commercially available from a variety of sources.
A “library of molecules” as used herein is a series of molecules displayed such that the compounds can be identified in a screening assay. The library may be composed of molecules having common structural features which differ in the number or type of group attached to the main structure or may be completely random. Libraries are meant to include but are not limited to, for example, phage display libraries, peptides-on-plasmids libraries, polysome libraries, aptamer libraries, synthetic peptide libraries, synthetic small molecule libraries and chemical libraries. Methods for preparing libraries of molecules are well known in the art and many libraries are commercially available. Libraries of interest include synthetic organic combinatorial libraries. Libraries, such as, synthetic small molecule libraries and chemical libraries. The libraries can also comprise cyclic carbon or heterocyclic structure and/or aromatic or polyaromatic structures substituted with one or more functional groups. Libraries of interest also include peptide libraries, randomized oligonucleotide libraries, and the like. Degenerate peptide libraries can be readily prepared in solution, in immobilized form as bacterial flagella peptide display libraries or as phage display libraries. Peptide ligands can be selected from combinatorial libraries of peptides containing at least one amino acid. Libraries can be synthesized of peptoids and non-peptide synthetic moieties. Such libraries can further be synthesized which contain non-peptide synthetic moieties which are less subject to enzymatic degradation compared to their naturally-occurring counterparts.
Small molecule combinatorial libraries may also be generated. A combinatorial library of small organic compounds is a collection of closely related analogs that differ from each other in one or more points of diversity and are synthesized by organic techniques using multi-step processes. Combinatorial libraries include a vast number of small organic compounds. One type of combinatorial library is prepared by means of parallel synthesis methods to produce a compound array. A “compound array” as used herein is a collection of compounds identifiable by their spatial addresses in Cartesian coordinates and arranged such that each compound has a common molecular core and one or more variable structural diversity elements. The compounds in such a compound array are produced in parallel in separate reaction vessels, with each compound identified and tracked by its spatial address. Examples of parallel synthesis mixtures and parallel synthesis methods are provided in U.S. Pat. No. 5,712,171 issued Jan. 27, 1998.
One type of library, which is known as a phage display library, includes filamentous bacteriophage which present a library of peptides or proteins on their surface. Phage display libraries can be particularly effective in identifying compounds which induce a desired effect in cells. Briefly, one prepares a phage library (using e.g. m13, fd, lambda or T7 phage), displaying inserts from 4 to about 80 amino acid residues using conventional procedures. The inserts may represent, for example, a completely degenerate or biased array. DNA sequence analysis can be conducted to identify the sequences of the expressed polypeptides. The minimal linear peptide or amino acid sequence that have the desired effect on the cells can be determined. One can repeat the procedure using a biased library containing inserts containing part or all of the minimal linear portion plus one or more additional degenerate residues upstream or downstream thereof.
For certain embodiments of this invention, e.g., where phage display libraries are employed, a preferred vector is filamentous phage, though other vectors can be used. Vectors are meant to include, e.g., phage, viruses, plasmids, cosmids, or any other suitable vector known to those skilled in the art. The vector has a gene, native or foreign, the product of which is able to tolerate insertion of a foreign peptide. By gene is meant an intact gene or fragment thereof. Filamentous phage are single-stranded DNA phage having coat proteins. Preferably, the gene that the foreign nucleic acid molecule is inserted into is a coat protein gene of the filamentous phage. Examples of coat proteins are gene III or gene VIII coat proteins. Insertion of a foreign nucleic acid molecule or DNA into a coat protein gene results in the display of a foreign peptide on the surface of the phage. Examples of filamentous phage vectors which can be used in the libraries are fUSE vectors, e.g., fUSE1 fUSE2, fUSE3 and fUSE5, in which the insertion is just downstream of the pill signal peptide. Smith and Scott, Methods in Enzymology 217:228-257 (1993).
By recombinant vector it is meant a vector having a nucleic acid sequence which is not normally present in the vector. The foreign nucleic acid molecule or DNA is inserted into a gene present on the vector. Insertion of a foreign nucleic acid into a phage gene is meant to include insertion within the gene or immediately 5′ or 3′ to, respectively, the beginning or end of the gene, such that when expressed, a fusion gene product is made. The foreign nucleic acid molecule that is inserted includes, e.g., a synthetic nucleic acid molecule or a fragment of another nucleic acid molecule. The nucleic acid molecule encodes a displayed peptide sequence. A displayed peptide sequence is a peptide sequence that is on the surface of, e.g. a phage or virus, a cell, a spore, or an expressed gene product.
In certain embodiments, the libraries may have at least one constraint imposed upon their members. A constraint includes, e.g., a positive or negative charge, hydrophobicity, hydrophilicity, a cleavable bond and the necessary residues surrounding that bond, and combinations thereof. In certain embodiments, more than one constraint is present in each of the broader sequences of the library.
In addition to the basic libraries, the methods can also be used to screen combinations of drugs. Thus, more than one type of drug can be contacted with each cell.
In other aspects of the invention, the cells do not necessarily need to be contacted with any compounds. The cells may be analyzed for phenotypic status based on environmental condition, such as in vivo or in vitro conditions. It is possible to analyze the differentiation state or tumorigenic state of a cell using the marker gene sets or metagenes of the invention. Thus, a cell may be subjected to conditions in vitro or in vivo and then analyzed for differentiation status.
Additionally, it is possible to screen sets of compounds to identify particular dosages effective at producing a phenotypic state in a cell. For instance, one or more drugs could be contacted with the cells at a variety of dosages over a large range. When the level of marker genes expressed in each of the cells is assessed, it will be possible to identify an optimum dosage for producing a particular phenotypic state of the cell. Additionally, if some markers are associated with the production of undesirable side effects, such as production of cytotoxic factors, then an optimum drug, combination of drug or dosage of drug can be identified using the methods of the invention.
The methods of the invention are useful for assaying the effect of compounds on cells or for analyzing the phenotypic status of a cell. The methods may be used on any type of cell known in the art. For instance the cell may be a cultured cell line or a cell isolated from a subject (i.e. in vivo cell population). The cell may have any phenotypic property, status or trait. For instance, the cell may be a normal cell, a cancer cell, a genetically altered cell, etc.
Cancers include, but are not limited to, basal cell carcinoma, biliary tract cancer; bladder cancer; bone cancer; brain and CNS cancer; breast cancer; cervical cancer; choriocarcinoma; colon and rectum cancer; connective tissue cancer; cancer of the digestive system; endometrial cancer; esophageal cancer; eye cancer; cancer of the head and neck; gastric cancer; intra-epithelial neoplasm; kidney cancer; larynx cancer; leukemia; liver cancer; lung cancer (e.g., small cell and non-small cell); lymphoma including Hodgkin's and non-Hodgkin's lymphoma; melanoma; myeloma; neuroblastoma; oral cavity cancer (e.g., lip, tongue, mouth, and pharynx); ovarian cancer; pancreatic cancer; prostate cancer; retinoblastoma; rhabdomyosarcoma; rectal cancer; renal cancer; cancer of the respiratory system; sarcoma; skin cancer; stomach cancer; testicular cancer; thyroid cancer; uterine cancer; cancer of the urinary system, as well as other carcinomas and sarcomas. Some cancer cells are metastatic cancer cells.
“Normal cells” as used herein refers any cell, including but not limited to mammalian, bacterial, plant cells, that is a non-cancer cell, non-diseased, or a non-genetically engineered cell. Mammalian cells include but are not limited to mesenchymal, parenchymal, neuronal, endothelial, and epithelial cells.
A “genetically altered cell” as used herein refers to a cell which has been transformed with an exogenous nucleic acid.
The present invention further concerns kits which contain, in separate packaging or compartments, the reagents such as adaptors and primers required for practicing the detection methods of the invention. Such kits typically include at least a population of detectable bead sets and preferably several different primers to generate a population of detectably labeled target molecules for detection. Such kits may optionally include the reagents required for performing ligation reactions, such as DNA or RNA ligases, PCR reactions, such as DNA polymerase, DNA polymerase cofactors, and deoxyribonucleotide-5′-triphosphates. Optionally, the kit may also include various polynucleotide molecules, restriction endonucleases, reverse transcriptases, terminal transferases, various buffers and reagents. Optimal amounts of reagents to be used in a given reaction can be readily determined by the skilled artisan having the benefit of the current disclosure.
The kits may also include reagents necessary for performing positive and negative control reactions. Preferably the kits include several target nucleic acids, in separate vials or tubes, or preferably, a set of combined standards comprising at least two different standards in the same vial or tube with known amount of dried standard nucleic acid(s) with instructions to dilute the sample in a suitable buffer, such as PBS, to a known concentration for use in the quantification reaction. Alternatively, the standard is pre-diluted at a known concentration in a suitable buffer, such as PBS. Suitable buffer can be either suitable for both for storing nucleic acids and for, e.g., PCR or direct enhancement reactions to enhance the difference between the standard and a corresponding target nucleic acid as described above, or the buffer is just for storing the sample and a separate dilution buffer is provided which is more suitable for the consequent PCR, enhancement and quantification reactions. In a preferred embodiment, all the standard nucleic acids are combined in one tube or vial in a buffer, so that only one standard mix can be added to a nucleic acid sample containing the target nucleic acid.
The kit also preferably comprises a manual explaining the reaction conditions and the measurement of the amount of target nucleic acid(s) using the standard nucleic acid(s) or a mixture of them and gives detailed concentrations of all the standards and of the type of buffer. Kits contemplated by the invention include, but are not limited to kits for determining the amount of target nucleic acids in a biological sample, and kits determining the amount of one or more transcripts that is expected to be increased or decreased after administration of a medicament or a drug, or as a result of a disease condition such as cancer.
The present invention also provides kits specific for the detection of particular gene expression signatures, as described above. For example, a kit containing target specific bead sets for detecting microRNA for use in determining microRNA expression profiles in samples, including for example diagnostic screening kits.
HL60 (human promyelocytic leukemia) cells were cultured in RPMI™ supplemented with 10% fetal bovine serum and antibiotics. Cells were treated with 1 μM tretinoin (all-trans-retinoic acid; SIGMA-ALDRICH™) in dimethylsulfoxide (DMSO; final concentration 0.1%) or DMSO alone for five days. Total RNA was isolated from bulk cultures with TRIzol Reagent (INVITROGEN™) in accordance with the manufacturer's directions. Cells cultured in microtiter plates were treated with 200 nM tretinoin or DMSO for two days and prepared for mRNA capture by the addition of Lysis Buffer (RNAture).
Total RNA was amplified and labeled using a modified Eberwine method, the resulting cRNA hybridized to Affymetrix GeneChip HG-U133A oligonucleotide microarrays, and the arrays scanned in accordance with the manufacturer's directions. Intensity values were scaled such that the overall fluorescence intensity of each microarray was equivalent. Expression values below an arbitrary baseline (20) were set to 20. These data are provided as Tables 5-8.
The 9466 probe-sets reporting above baseline were first divided into up- and down-regulated groups by differences in mean expression levels between tretinoin and vehicle treatments. Each of these groups was further divided into three sets of approximately equal size on the basis of the lower mean expression level. The selected basal expression categories were 20-60 (low), 60-125 (moderate) and >125 (high). Probe-sets reporting small (1.5-2.5×), medium (3-4.5×) or large (>5×) changes in mean expression level within each basal expression category were extracted and ranked by signal to noise ratio. The top five probes mapping to unique RefSeq identifiers according to NetAffx in each of the eighteen categories were selected, populating nine sets of ten genes (Table 2).
Upstream LMA probes were composed (5′ to 3′) of the complement of the T7 primer site (TAA TAC GAC TCA CTA TAG GG) (SEQ ID NO: 876), a 24 nt barcode, and a 20 nt gene-specific sequence. Downstream LMA probes were 5′-phosphorylated and contained a 20 nt gene-specific sequence and the T3 primer site (TCC CTT TAG TGA GGG TTA AT) (SEQ ID NO: 877). Barcode sequences were developed by Tm Bioscience and detailed in the FlexMAP Microspheres Product Information Sheet (LUMINEX™). Gene-specific fragments of LMA probes were designed against the Oligator Human Genome RefSet keyed by RefSeq identifier. A 40 nt region was manually selected from within these 70 nt sequences to yield two fragments of equal length with roughly similar base composition and juxtaposing nucleotides being C-G or G-C, where possible. Probe sequences are provided as Table 3. Capture probes contained the complement of the barcode sequences and had 5′-amino modification and a C12 linker. The T7 primer (5′-TAA TAC GAC TCA CTA TAG GG-3′) (SEQ ID NO: 876) was 5′-biotinylated. The T3 primer has the sequence 5′-ATT AAC CCT CAC TAA AGG GA-3′ (SEQ ID NO: 878). Oligonucleotides (all with standard desalting) were from Integrated DNA Technologies.
xMAP Multi-Analyte COOH Microspheres (LUMINEX™) were coupled to capture probes in a semi-automated microtiter plate format. Approximately 2.5×106 microspheres were dispensed to the wells of a V-bottomed microtiter plate, pelleted by centrifugation at 1800 g for 3 min, and the supernatant removed. Beads were resuspended in 25 μl of binding buffer [0.1M 2-(N-morpholino)ethansulfonic acid, pH 4.5] by sonication and pipeting, and 100 pmol of capture probe added. Two and a half μl of a freshly prepared 10 mg/ml aqueous solution of 1-ethyl-3-[3-dimethylaminopropyl]carbodiimide hydrochloride (Pierce) was added, and the plate incubated at room temperature in the dark for 30 min. This addition and incubation step was repeated, and 180 μl 0.02% Tween-20 added with mixing. Beads were pelleted by centrifugation, as before, and washed sequentially in 180 μl 0.1% SDS and 180 μl TE (pH 8.0) with intervening spins. Coupled microspheres were resuspended in 50 μl TE (pH 8.0) and stored in the dark at 4° for up to one month. Bead mixes were freshly prepared and contained ˜1.5×105/ml of each microsphere in 1.5×TMAC buffer [4.5 M tetramethylammonium chloride; 0.15% N-lauryl sarcosine; 75 mM tris-HCl, pH 8.0; 6 mM EDTA, pH 8.0]. The mapping of bead number to capture probe sequence is provided as Table 4.
Transcripts were captured in oligo-dT coated 384 well plates (GenePlateHT; RNAture) from total RNA (500 ng) in Lysis Buffer (RNAture) or whole cell lysates (20 μl). Plates were covered and centrifuged at 500 g for 1 min, and incubated at room temperature for 1 h. Unbound material was removed by inverting the plate onto an absorbent towel and spinning as before. Five μl of an M-MLV reverse transcriptase reaction mix (Promega) containing 125 μM of each dNTP (INVITROGEN™) was added. The plate was covered, spun as before, and incubated at 37° for 90 min. Wells were emptied by centrifugation, as before. Ten fmol of each probe was added in 1×Taq Ligase Buffer (NEW ENGLAND BIOLABS™) (5 μl), the plate covered, spun as before, heated at 95° for 2 min and maintained at 50° for 6 h. Unannealed probes were removed by centrifugation, as before. Five μl of 1×Taq Ligase Buffer containing 2.5 U Taq DNA ligase (NEW ENGLAND BIOLABS™) was added, the plate covered, spun as before and incubated at 45° for 1 h followed by 65° for 10 min. Wells were emptied by centrifugation, as before. Fifteen μl of a HotStarTaq DNA Polymerase mix (QIAGEN™) containing 16 μM of each dNTP (INVITROGEN™) and 100 nM of T3 primer and biotinylated-T7 primer was added. The plate was covered, spun as before, and PCR performed in a THERMO ELECTRON™ MBS 384 Satellite Thermal Cycler (initial denaturation of 92° for 9 min; 92° for 30 s, 60° for 30 s, 72° for 30 s for 39 cycles; final extension at 72° for 5 min).
Fifteen μl of LMA reaction product was mixed with 5 μl TE (pH 8.0) and 30 μl of bead mix (˜4500 of each microsphere) in the wells of a Thermowell P microtiter plate (Costar). The plate was covered and incubated at 95° for 2 min and maintained at 45° for 60 min. Twenty μl of a reporter mix containing 10 ng/μl streptavidin R-phycoerythrin conjugate (MOLECULAR PROBES™) in 1×TMAC buffer [3 M tetramethylammonium chloride; 0.1% N-lauryl sarcosine; 50 mM tris-HCl, pH 8.0; 4 mM EDTA, pH 8.0] was added with mixing and incubation continued at 45° for 5 min. Beads were analyzed with a LUMINEX™ 100 instrument. Sample volume was set at 50 μl and flow rate was 60 μl/min. A minimum of 100 events were recorded for each bead set and median fluorescence intensities (MFI) computed. Expression values for each transcript were corrected for background signal by subtracting the MFI of corresponding bead sets from blank (ie TE only) wells. Values below an arbitrary baseline (5) were set to 5, and all were normalized against an internal control feature (GAPDH-3′).
k-Nearest-Neighbor (KNN) Classifier:
The IVT-GeneChip data from long duration high dose tretinoin or vehicle treatments were used to train a series of KNN classifiers in the spaces of the full ninety member gene set and each of the nine ten member gene categories. These were applied to the corresponding data from the eighty-eight LMA-FlexMAP test samples whose internal reference feature (GAPDH-3′) was within two standard deviations from the mean. To permit the cross-platform analysis, both the train and test data sets were normalized so that each gene had a mean of zero and a standard deviation of one. The KNN algorithm classifies a sample by assigning it the label most frequently represented among the k nearest samples. In this case k was set to 3. The votes of the nearest neighbors were weighted by one minus the cosine distance. This analysis was performed with the GenePattern software package at world wide web address broad “dot” mit “dot” edu under cancer/software/genepattern.
Measurement of seventy and eight-one transcripts has been shown to outperform established clinical and histologic parameters in disease outcome prediction for breast cancer (van de Vijver et al., 2002) and follicular lymphoma (Glas et al., 2005), respectively. Signatures of similar size and comparable prognostic power are sure to follow for a wide variety of diseases. A five member gene expression signature has also been used successfully in a cell-based small molecule screen for agents inducing the differentiation of human leukemia cells (Stegmaier et al., 2004). The absence of reliance upon prior target identification makes gene expression signature screening a powerful new strategy in drug discovery. However, immediate implementation of these and other important medical and pharmaceutical applications of genomics research is now blocked simply by the absence of a cost-effective gene expression profiling solution tailored specifically for the analysis of any feature set of up to one hundred transcripts.
High-density oligonucleotide microarrays (Lockhart et al., 1996) coupled with RNA amplification and labeling based on in vitro transcription (Van Gelder et al., 1990) provide the solution of choice for unbiased transcriptome analysis. However, the number and complexity of manipulations required, together with the cost of reagents, instrumentation, and the arrays themselves preclude its use for routine clinical and high-throughput applications. Fluorescence mediated real-time RT-PCR integrates amplification, labeling and detection Gibson et al., 1996; Morrison et al., 1998; Tyagi and Fr, 1996) and is ideal for quantitative assessment of individual transcripts. But the absence of a stable multiplex implementation makes this approach equally unsuitable for signature analysis. Conventional multiplex RT-PCR is simple and cheap but suffers from low amplification fidelity, not to mention the absence of a convenient way to detect, identify and quantify multiple amplicons.
Ligation-mediated amplification (LMA), in which two oligonucleotide probes are annealed immediately adjacent to each other on a complementary target DNA or RNA molecule and fused together by a DNA ligase (Landegren et al., 1988; Nilsson et al., 2000) to yield an synthetic amplification template (Hsuih et al., 1996), provides high targeting specificity and, by incorporating universal primer recognition sequences in fixed length ligation products, maintains target representation during multiplex PCR. Further, the ability to include distinct sequence addresses in one of the paired probes allows each of the resulting amplicons to be uniquely identified. Two gene expression profiling solutions based upon these principles—known as RASL (Yeakley et al., 2002) and RT-MLPA (Eldering et al., 2003)—each allowing the simultaneous analysis of around fifty transcripts, have been described.
The LUMINEX™ xMAP technology platform is composed of a basic auto-injecting bench-top two laser flow cytometer and a panel of one hundred sets of carboxylated polystyrene microspheres, each set being impregnated with different proportions of two fluorophores, allowing each bead to be classified on its passage through the flow cell world wide web address luminexcorp “dot” com. Furnishing bead sets with so-called molecular barcodes (Shoemaker et al., 1996)—short unique DNA sequences with uniform hybridization characteristics—delivers an optimized universal detection solution for amplicons designed to contain complementary sequences (Iannone et al., 2000). The simplicity, flexibility, throughput and modest capital and operating costs of the LUMINEX™ system compares very favorably with the self-assembled bead fiber-optic bundle array and capillary electophoresis detection pieces intrinsic to the RASL and RT-RLPA procedures (Eldering et al., 2003; Yeakley et al., 2002). This motivated evaluation of an integrated LMA-FlexMAP gene expression signature analysis solution (
A ninety member gene expression signature was derived from an unbiased genome-wide transcriptional analysis of a cell culture model of differentiation. Total RNA was isolated from HL60 cells following treatment with tretinoin or vehicle (DMSO) alone, amplified and labeled by in vitro transcription (IVT), and target hybridized to Affymetrix GeneChip microarrays. Features reporting above threshold were binned into three groups of equal size on the basis of expression level. Ten transcripts exhibiting low, moderate and high differential expression between the two conditions were then selected from each bin, populating a matrix of nine classes (Table 2) representing the diversity of expression characteristics.
Probe pairs incorporating unique FlexMAP barcode sequences were designed against each of the ninety transcripts (Table 3) and ten aliquots of the two original RNA samples were analyzed in this space by LMA-FlexMAP. Following subtraction of background signals, thresholding and normalization against an internal reference control feature (ie GAPDH), 98.5% of data points fell within two fold of their corresponding means (
There was a poor overall correlation between the mean expression levels reported by the two platforms (correlation coefficient=0.714). LMA-FlexMAP appears to overestimate transcript levels relative to IVT-GeneChip but to a degree inversely related to absolute level (
Next, we applied our method to an idealized gene expression signature analysis problem, requiring the ability to diagnose the presence of a predefined biological state in each of a large number of samples. Data were collected for our ninety gene feature set from ninety-four microtiter well cultures of HL60 cells each treated with either tretinoin or vehicle alone. Drug concentration and treatment duration were reduced by 80% and 60%, respectively, to model the sub-maximal signatures encountered in a small molecule screen. Process time from the additional of cell lysis buffer to data delivery was sixteen hours, and overall unit cost was approximately $2. Six wells (6.4%) had internal control features signals more than two standard deviations from the mean and were discarded. This throughput and overall drop out rate is typical.
Although the feature set was designed to represent the diversity of expression characteristics rather than to contain the transcripts most highly correlated with the distinction, a k-nearest-neighbor (KNN) classifier (Cover and Hart, 1967) trained on the original high dose long duration IVT-GeneChip data delivered 100% classification accuracy for these low dose short duration samples in the full ninety gene feature space. Classifiers built in the space of each of the nine ten member gene categories had error rates between 14.8% (medium level, low differential expression) and 0% (high level, high differential expression) (Table 1). These results demonstrate both the successful deployment of our solution and the advantage of a method with higher level multiplexing capability.
Our solution underestimates changes in expression level relative to the industry-standard high-end state-of-the-art gene expression profiling platform. However, its impressive classification accuracy in an idealized application indicates that performance can easily be sacrificed for throughput in pursuit of a practical gene expression signature analysis solution, and bodes well for the rapid deployment of any legacy signature with minimal or even no optimization. The assessments reported here also suggest that new signatures designed specifically for this platform should exploit the full content capacity and avoid transcripts expressed at low or moderate levels with low degrees of differential expression. With its simplicity, flexibility, throughput and cost-effectiveness the LMA-FlexMAP method has been a transformative tool in our laboratories whose exploitation for biological discovery shall be reported elsewhere.
Details of sample information are available in Table 9. Total RNAs were prepared from tissues or cell lines using TRIzol (INVITROGEN™, Carlsbad, Calif.), as described (Ramaswamy et al., 2001), and in compliance with IRB protocols. Leukemia bone marrow mononuclear cells were collected from patients treated at ST. JUDE CHILDREN'S RESEARCH HOSPITAL™ and at DANA-FARBER CANCER INSTITUTE™ and their immunophenotype and genotype determined as previously described (Ferrando et al., 2002; Yeoh et al., 2002). Normal mouse lung and mouse lung cancer samples were collected from KRasLA1 mice, and genotyped as described (Johnson et al., 2001). Lungs from four- to five-month old mice were inflated with phosphate-buffered saline prior to removal. Individual lung tumors and normal lungs were dissected and immediately frozen on dry ice before RNA preparation. HL-60 cells were plated at 1.5×105 cell/ml and induced to differentiate by 1 μM all-trans retinoic acid (SIGMA™, St. Louis, Mo.; in ethanol). Cells were harvested after 1, 3 and 5 days. Culturing conditions for other cells are detailed in Example 3.
miRNA Labelling
Target preparation from total RNA follows the described procedure (Miska et al., 2004), with modifications. Briefly, two synthetic pre-labeling-control RNA oligonucleotides (5′-pCAGUCAGUCAGUCAGUCAGUCAG-3′ (Seq ID No: 872), and 5′-pGACCUCCAUGUAAACGUACAA-3′ (Seq ID No: 873), DHARMACON™, Lafayette, Colo.) were used to control for target preparation efficiency. They were each spiked at 3 fmoles per μg total RNA. Small RNAs (18- to 26-nucleotide) were recovered from 1 to 10 μg total RNA through denaturing polyacrylamide gel purification. Small RNAs were adaptor-ligated sequentially on the 3′-end and 5′-end using T4 RNA ligase (AMERSHAM BIOSCIENCES™, Piscataway, N.J.). After reverse-transcription using adaptor-specific primer, products were PCR amplified (95° C. 40 sec, 50° C. 30 sec, 72° C. 30 sec, 18 cycles for 10 μg starting total RNA; 3′-primer: 5′-tactggaattcgcggtta-3′ (Seq ID No: 874), 5′ primer: 5′-biotin-caacggaattcctcactaaa-3′ (Seq ID No: 875), IDT, Coralville, Iowa). For side-by-side comparison of the bead-detection and the glass-microarray, a 5′-Alexa-532-modified primer was used for compatibility with the glass-microarray. PCR products were precipitated and dissolved in 66 μl TE buffer (10 mM TrisHCl, pH8.0, 1 mM EDTA) containing two biotinylated post-labeling-control oligonucleotides (100 fmoles of FVR506, and 25 fmoles PTG20210, see Table 10).
miRNA capture probes were 5′-amino-modified oligonucleotides with a 6-carbon linker (IDT). Capture probes for miRNAs and controls were divided into three sets (see Table 10), and each sample was profiled in 3 assays on these three probe sets separately. Probes were conjugated to carboxylated xMAP beads (LUMINEX™ Corporation, Austin, Tex.) in 96-well plates, following the manufacturer's protocol. For each probe set, 3 μl of every probe-bead conjugate were mixed into 1 ml of 1.5×TMAC (4.5 M tetramethylammonium chloride, 0.15% sarkosyl, 75 mM Tris-HCl, pH 8.0, 6 mM EDTA). Samples were hybridized in a 96-well plate, with two mock PCR samples (using water as template) in each plate for background control. Hybridization was carried out with 33 μl of the bead mixture and 15 μl of labelled material, at 50° C. overnight. Beads were spun down, resuspended in 1×TMAC containing 10 μg/ml streptavidin-phycoerythrin (MOLECULAR PROBES™, Eugene, Oreg.) and incubated at 50° C. for 10 minutes before data acquisition on a LUMINEX™ 100IS machine. Median fluorescence intensity values were measured.
Profiling data were first scaled according to the post-labeling-controls and then the pre-labeling-controls, in order to normalize readings from different probe/bead sets for the same sample, and to normalize for the labeling efficiency, as detailed in Materials and Methods of Example 3. Data were thresholded at 32 and log2-transformed. Hierarchical clustering was performed with average linkage and Pearson correlation. Prior to clustering, data were filtered to eliminate genes with expression lower than 7.25 (on log2 scale) in all samples. Next, all features were centered and normalized to a mean of 0 and a standard deviation of 1. k-Nearest-Neighbor classification of normal vs. tumor was performed with k=3 in the selected feature space using Euclidean distance measure. Note that different metrics were used for clustering and normal/tumor classification. Features were selected for the distinction between all normal samples vs. all tumors (for colon, kidney, prostate, uterus, lung and breast; P<0.05 after Bonferroni-correction). P values were calculated using a variance-fixed t-test with a minimal standard deviation of 0.75, after confounding the tissue types. Multi-class predictions of poorly differentiated tumors were performed using the probabilistic neural network algorithm, a Gaussian-weighted nearest neighbor method. For each test sample, the tissue type that had the highest probability in multiple one-tissue-versus-the-rest predictions was assigned. Feature number and the Gaussian width were optimized based on leave-one-out cross-validations on the training data set. Features were selected based on the variance-fixed t-test score, requiring equal number of up- and down-regulated features. Distances were based on the cosine in the selected feature space.
miRNA expression data have been submitted to GEO at world wide web address at ncbi “dot” nih “dot” gov/geo, with a series accession number of GSE2564. mRNA expression data were published previously (Ramaswamy et al., 2001), and are available together with miRNA expression data at world wide web address broad “dot” mit “dot” edu under cancer/pub.
Much progress has been made over the past decade in developing a molecular taxonomy of cancer (see review Chung et al., 2002). In particular, it has become clear that among the ˜22,000 protein-coding transcripts are mRNAs capable of classifying a wide variety of human cancers (Ramaswamy et al., 2001). Recently, hundreds of small, non-coding miRNAs have been discovered (see review Bartel, 2004). The first identified miRNAs, the products of the C. elegans genes lin-4 and let-7, play important roles in controlling developmental timing and probably act by regulating mRNA translation (Ambros and Horvitz, 1984; Lee et al., 1993; Reinhart et al., 200). When lin-4 or let-7 is inactivated, specific epithelial cells undergo additional cell divisions as opposed to their normal differentiation. Since abnormal proliferation is a hallmark of human cancers, it seemed possible that miRNA expression patterns might denote the malignant state. Furthermore, altered expression of a few miRNAs has been found in some tumor types (Calin et al., 2002; E is et al., 2005; Johnson et al., 2005; Michael et al., 2003). However, the potential for miRNA expression to inform cancer diagnosis has not been systematically explored.
To determine the expression pattern of all known miRNAs, we first needed to develop an accurate and inexpensive profiling method. This goal is challenging, because of the miRNAs' short size (around 21 nucleotides) and the sequence similarity of members of miRNA families. Glass-slide microarrays have been used for miRNA profiling (Babak et al., 2004; Barad et al., 2004; Liu et al., 2004; Miska et al., 2004; Nelson et al., 2004; Thomson et al., 2004; Sun et al., 2004), but cross-hybridization of related miRNAs has been problematic. We therefore developed a bead-based profiling method. Oligonucleotide-capture probes complementary to miRNAs of interest were coupled to carboxylated 5-micron polystyrene beads impregnated with variable mixtures of two fluorescent dyes that yield up to 100 colors, each representing a miRNA. Following adaptor ligations utilizing both the 5′-phosphate and the 3′-hydroxyl groups of miRNAs (Miska et al., 2004), reverse-transcribed miRNAs were PCR-amplified using a common biotinylated primer, hybridized to the capture beads, and stained with streptavidin-phycoerythrin. The beads were then analyzed on a flow cytometer capable of measuring bead color (denoting miRNA identity) and phycoerythrin intensity (denoting miRNA abundance) (
Bead-based hybridization has the theoretical advantage that it may more closely approximate hybridization in solution and as such the specificity might be expected to be superior to glass microarray hybridization. Indeed, a spiking experiment involving 11 related sequences comparing bead-based detection to microarray-based detection demonstrated increased specificity of beads compared to microarrays, even for single base-pair mismatches (
We then set out to determine the expression pattern of all known miRNAs across a large panel of samples representing a diversity of human tissues and tumor types. While miRNA expression has been previously explored in small sets of tissues (Babak et al., 2004; Barad et al., 2004; Liu et al., 2004; Nelson et al., 2004; Thomson et al., 2004; Sun et al., 2004) or isolated cell types (e.g. chronic lymphocytic leukemia in Calin et al., 2001), the extent of differential expression of miRNAs across cancers has not been previously determined. Indeed, one might not have expected that miRNA expression patterns would be informative with respect to cancer diagnosis, because of the relatively small number of miRNAs encoded in the genome. Remarkably, we observed differential expression of nearly all miRNAs across cancer types (
Furthermore, the miRNAs partitioned tumors within a single lineage. For example, we examined the miRNA profiles of 73 bone marrow samples obtained from patients with acute lymphoblastic leukemia (ALL). As shown in
Among the epithelial samples, those of the gastrointestinal tract were of particular interest. Samples from colon, liver, pancreas and stomach all clustered together (
Having determined that miRNA expression distinguishes tumors of different developmental origin, we next asked whether miRNAs could be used to distinguish tumors from normal tissues. We previously reported that there exist no robust mRNA markers that are uniformly differentially expressed across tumors and normal tissues of different lineages (Ramaswamy et al., 2001). It was therefore striking to observe that despite the fact that some miRNAs are upregulated or unchanged, the majority of the miRNAs (129/217, p<0.05, after correction for multiple hypothesis testing) had lower expression in tumors compared to normal tissues, irrespective of cell type (
To exclude any possibility that the differential miRNA expression might be related to differences in collection of tumor vs. normal samples, we studied a mouse model of KRas-induced lung cancer (Johnson et al., 2001). We isolated miRNAs from normal lung or lung adenocarcinomas from individual mice, thereby precluding any differences in collection procedure. Notably, because of miRNA sequence conservation between human and mouse, the same miRNA capture beads could be used to profile the murine samples. As shown in
Our observation that miRNA expression appeared globally higher in normal tissues compared to tumors led to the hypothesis that global miRNA expression reflects the state of cellular differentiation. To test this hypothesis, we explored an experimental model in which we treated the myeloid leukemia cell line HL-60 with all-trans retinoic acid, a potent inducer of neutrophilic differentiation (Stegmaier et al., 2004). As predicted, miRNA profiling demonstrated the induction of many miRNAs coincident with differentiation (
We next turned to a more challenging diagnostic distinction: that of tumors of histologically uncertain cellular origin. It is estimated that 2%-4% of all cancer diagnoses represent cancers of unknown origin or diagnostic uncertainty (see review Pavlidis et al., 2003). To address this, we analyzed 17 poorly differentiated tumors whose histological appearance alone was non-diagnostic, but whose clinical diagnosis was established by anatomical context, either directly (e.g. a primary tumor arising in the colon) or indirectly (a metastasis of a previously identified primary). A training set of 68 more-differentiated tumors representing 11 tumor types for which both mRNA and miRNA profiles were available was used to generate a classifier. This classifier was then used without modification to classify the 17 poorly-differentiated test samples. As a group, poorly differentiated tumors had lower global levels of miRNA expression compared to the more-differentiated training set samples (
The experiments reported here demonstrate the feasibility and utility of monitoring the expression of miRNAs in human cancer. The unexpected findings are the extraordinary level of diversity of miRNA expression across cancers and the large amount of diagnostic information encoded in a relatively small number of miRNAs. The implication is that, unlike with mRNA expression, a modest number of miRNAs (˜200 in total) might be sufficient to classify human cancers. Moreover, the bead-based miRNA detection method has the attractive property of being not only accurate and specific but also being easily implementable in a routine clinical setting. In addition, unlike mRNAs, miRNAs remain largely intact in routinely collected, formalin-fixed paraffin-embedded clinical tissues (Nelson et al., 2004). More work is required to establish the clinical utility of miRNA expression in cancer diagnosis, but the work described here indicates that miRNA profiling has unexpected diagnostic potential. The mechanism by which miRNAs are under-expressed in cancer remains unknown. We did not observe substantive decreases of mRNAs encoding components of the miRNA processing machinery (Dicer, Drosha, Argonaute2, DGCR8 (Cullen, 2004), Example 3), but clearly other mechanisms of regulating miRNAs are possible.
The findings reported here are consistent with the hypothesis that in mammals, as in C. elegans, miRNAs can function to prevent cell division and drive terminal differentiation. An implication of this hypothesis is that down-regulation of some miRNAs might play a causal role in the generation or maintenance of tumors. Epithelial cells affected in C. elegans lin-4 and let-7 miRNA mutants generate a stem-cell-like lineage, dividing to produce daughters that, like themselves, divide rather than differentiate (Ambros and Horvitz, 1984; Reinhart et al., 2000). We speculate that aberrant miRNA expression might similarly contribute to the generation or maintenance of “cancer stem cells” recently proposed to be responsible for cancerous growth in both leukemias and solid tumors (Al-Hajj et al., 2003; Lapidot et al., 1994; Reya et al., 2001; Singh et al., 2004).
Additional information about the paper and a frequently-asked-questions (FAQ) page are available at ______.
HEL, TF-1, PC-3, MCF-7, HL-60, SKMEL-5, 293 and K562 cells were obtained from the AMERICAN TYPE CULTURE COLLECTION™ (ATCC™, Manassas, Va.), and cultured according to ATCC™ instructions. All T-cell ALL cell lines were cultured in RPMI™ medium supplemented with 10% fetal bovine serum. CCRF-CEM and LOUCY cells were obtained from ATCC™. ALL-SIL, HPB-ALL, PEER, TALL1, P12-ICHIKAWA cells were obtained from the German Collection of Microorganisms and Cell Cultures (DSMZ, Braunschweig, Germany). SUPT11 cells were a kind gift of Dr. Michael Cleary at Stanford University.
Umbilical cord blood was obtained under an IRB approved protocol from the Brigham and Women's Hospital. Light-density mononuclear cells were separated by Ficoll-Hypaque centrifugation, and CD34+ cells (85-90% purity) were enriched using Midi-MACS columns (Miltenyi Biotec, Auburn, Calif.). Erythroid differentiation of the CD34+ cells was induced in two stages in liquid culture (Ebert et al., 2005). For the first seven days, cells were cultured in Serum Free Expansion Medium (SFEM, Stem Cell Technologies, Tukwila, Wash.) supplemented with penicillin/streptomycin, glutamine, 100 ng/mL stem cell factor (SCF), 10 ng/mL interleukin-3 (IL-3), 1 μM dexamethasone (SIGMA™), 40 μg/ml lipids (SIGMA™), and 3 IU/ml erythropoietin (Epo). After 7 days, cells were cultured in the same medium without dexamethasone and supplemented with 10 IU/ml Epo. For flow cytometry analyses, approximately 1 to 5×105 cells were labeled with a phycoerythrin-conjugated antibody against glycophorin-A (CD235a, Clone GA-R2, BD-PHARMINGEN™, San Jose, Calif.) and a FITC-conjugated antibody against CD71 (Clone M-A712, BD-PHARMINGEN™). Flow cytometry analyses were performed using a FACScan flow cytometer (BECTON DICKINSON™).
Glass-Slide Detection of miRNAs
Glass slide microarrays were spotted oligonucleotide arrays and hybridized as described previously (Miska et al., 2004). Briefly, 5′-amino-modified oligonucleotide probes (the same ones as used on the bead platform) were printed onto amide-binding slides (CodeLink, AMERSHAM BIOSCIENCES™). Printing and hybridization were done following the slides manufacturer's protocols with the following modifications: oligonucleotide concentration for printing was 20 μM in 150 mM sodium phosphate, pH 8.5. Printing was done on a MicroGrid TAS II arrayer (BioRobotics) at 50% humidity. Labeled PCR product was resuspended in hybridization buffer (5×SSC, 0.1% SDS, 0.1 mg/ml salmon sperm DNA) and hybridized at 50° C. for 10 hours. Microarray slides were scanned using an arrayWoRxe biochip reader (APPLIED PRECISION™) and primary data were analyzed using the Digital Genome System suite (MOLECULARWARE™).
Northern blot analyses were carried out as described (Lau et al., 2001). Total RNAs from cell lines were loaded at 10 μg per lane. Blots were detected with DNA probes complementary for human miR-20, miR-181a, miR-15a, miR-16, miR-17-5p, miR-221, let-7a, and miR-21.
Reverse transcription (RT) reactions were carried out on 50 to 200 ng total RNA in 10 μl reaction volumes, using the TAQMAN™ reverse transcription kit (APPLIED BIOSYSTEMS™, Foster City, Calif.) and random hexamers, following the manufacturer's protocol. RT products were diluted 5-fold in water and assayed using TAQMAN™ Gene Expression Assays (APPLIED BIOSYSTEMS™) in triplicates, on an ABI PRISM 7900HT real-time PCR machine. Efficiency of PCR amplification was determined by 5 two-fold-serial-diluted samples from HL-60 cDNA. The TAQMAN™ Gene Expression Assays used are listed in the parentheses. (Dicer1: Hs00998566_m1; Ago2/EIF2C2: Hs00293044_m1; Drosha/RNase3L: Hs00203008_m1; DGCR8: Hs00256062_m1; and eukaryotic 18S rRNA endogenous control)
To eliminate bead-specific background, the reading of every bead for every sample was first processed by subtracting the average readings of that particular bead in the two-embedded mock-PCR samples in each plate. As stated in the Methods, every sample was assayed in three wells. Each of the three wells contained 94 probes (19 common probes and 75 unique ones). Out of the 19 common probes are the two pre-labeling controls and the two post-labeling controls. Quality control was performed as part of the preprocessing by requiring that the reading from each control probe exceeds some minimal probe-specific threshold. These thresholds were determined by identifying a natural lower cutoff, i.e. a dip, in the distribution of each control probe. The cutoff values were chosen based on a set of samples in a pilot study. The lower post-control should be greater than 500 and the higher post-control must exceed 2450. The lower and higher pre-controls should exceed 1400 and 2000 respectively (after well-to-well scaling). In this study, about 70% of the samples passed the quality control. Note that the above specifications were used on version 1 of the platform. A similar preprocessing was performed on version 2 of the platform.
Preprocessing was done in four steps: (i) well-to-well scaling—the reading from each well were scaled such that the total of the two post-labeling controls, in that well, became 4500 (a median value based on a pilot study); (ii) sample scaling—the normalized readings were scaled such that total of the 6 pre-labeling controls in each sample reached 27,000 (a median value based on a pilot study); (iii) thresholding at 32 (see below); and (iv) log2 transformation. All control probes, as well as a probe (EAM296) which had a high background in the absence of any prepared target, were removed before any further analysis. After eliminating these probes, 217 (255 for version 2 of the platform) features were left and these were used throughout the analysis.
miRNA expression data first underwent filtering. The purpose of this filtering is to remove features which have no detectable expression and thus are uninformative but may introduce noise to the clustering. A miRNA was regarded as “not expressed” or “not detectible”, if in none of the samples, that particular miRNA has an expression value above a minimal cutoff. We applied a cutoff of 7.25 (after data were log2-transformed). This cutoff value was determined based on noise analyses of target preparation and bead detection (see below and
k-Nearest Neighbor (kNN) Prediction
After feature filtration (described in the hierarchical clustering), marker selection was performed on 187 features. The variance-thresholded t-test score was used as a measure to score features. A minimal standard deviation of 0.75 was applied. Markers were searched among the filtered miRNAs. Nominal P-value was calculated for each feature, by permuting the class labels of the samples. In order to select features that best distinguish tumors from normal samples on all tissue types, i.e. taking into account the confounding tissue-type phenotype, restricted permutations were performed (Good, 2004). In restricted permutations, one shuffles the tumor/normal labels only within each tissue type to get the distribution under the desired null hypothesis. To achieve accurate estimates for the p-values, 400 times the number of features (400×187=74,800) of iterations were performed. To correct for multiple-hypotheses testing, markers were selected requiring the Bonferroni-corrected P-values to be less than 0.05. kNN prediction was performed using the kNN module in the GenePattern software, with k=3 and a Euclidean distance measure (GenePattern at ______).
A two-class PNN (Specht, 1990) prediction was calculated based on the following class posterior probability:
where x is the predicted sample and c is the class for which the posterior probability is calculated. The training set samples are yi, nc is the number of samples of class c in the training set, and D(x,yi) is the distance between the predicted sample and training sample i. In our case, the sum in the denominator (of c′) is over two class values, since we predict a sample either to belong or not to belong to a specific tissue-type. Note that the first step is derived using Bayes rule which allows to incorporate a prior probability for each class, P(c). We used a uniform prior over all 11 tissue-types which translated to 1/11 for being in a certain type and 10/11 for not being in that type. We did not use the tissue-type frequencies in the training set since they likely do not represent the frequencies of different tumors in the general population.
Multi-class prediction using PNN was achieved by breaking down the question into multiple one vs. the rest (OVR) predictions. To perform PNN OVR two-class classification, we built a model based on the training set. This model has two parameters: the number of features used, and σ (the standard deviation of the Gaussian kernel which is used to calculate the contribution of each training sample to the classification). The optimal parameters (for each OVR classifier) were selected using a leave-one-out cross-validation procedure from all possible parameter-pairs in which the number of features ranges from 2 to 30 in steps of 2 and σ takes the values from 1 to 4 times the median nearest neighbor distance, in steps of 0.5 (a total number of 105 combinations). The best model was determined by (i) the fewest number of leave-one-out errors on the training set, which include both false-positive and false-negative errors with the same weight, and (ii) among all conditions with the same error rate, the parameters that gave rise to the maximal mean log-likelihood of the training set were selected. The mean log-likelihood is defined
as
where ci is the true class of sample x, and the probability is evaluated using the model M. The top n features were selected using the variance-thresholded t-test score in a balanced manner; n/2 features with the top positive scores and n/2 features with most negative scores. The cosine distance measure was used; D(x,yi)=1-cosine(x,yi).
A Binomial distribution was used to calculate the probability to obtain at least the number of correct classifications (on the test set) as we observed. Assuming a random classifier would predict the tissue-type randomly with a uniform distribution over the 11 possible outcomes, the probability of a correct classification is 1/11. This is applicable to the PNN prediction, in which the background frequency of each tissue type was assumed to be 1/11. The p-value is, therefore, the tail of the Binomial distribution from the observed number of correct classifications, s, to the total number of samples in the test set, n:
where p is one over the number of tissue-types (1/11, in our case) and t is the number of correct classification which goes from the observed number, s, to the maximum of possible correct samples n.
Development of a Bead-Based miRNA Profiling Platform
Compared with glass-based microarrays, bead-based profiling solutions have the advantages of higher sample throughput and liquid phase hybridization kinetics, while having the disadvantage of lower feature throughput. For the genomic analysis of miRNA expression, this disadvantage is negligible because of the relative small number of identified miRNAs. Since new miRNAs are still being discovered, the flexibility and ease of these “liquid chips” to introduce new features is of particular value.
We developed a bead-based miRNA profiling platform, as detailed in the Methods section. Version 1 of this platform (used for most samples in this study) covers 164 human, 185 mouse, and 174 rat miRNAs, according to Rfam 5.0 miRNA registry database (Ambros et al., 2003; Griffiths-Jones, 2004). Version 2 of this platform (used for the acute lymphoblastic leukemia study and the erythroid differentiation study) covers additional 24 human, 13 mouse and 2 rat miRNAs (refer to Table 10 for details).
This profiling platform is compatible in theory with any miRNA labeling method that labels the sense strand. For our study, we followed one described by Miska et al., 2004 that labels mature miRNAs through adaptor ligation, reverse-transcription and PCR amplification. We reasoned that the amplification step will allow future use of these labeled materials, which were from precious clinical samples. Defined amounts of synthetic artificial miRNAs were added into each sample of total RNAs as pre-labeling controls. This allows us to normalize the profiling data according to the starting amount of total RNA, using readings from capture probes for these synthetic miRNAs (see Methods for details). This contrasts the use of total feature intensity to normalize the readings of different samples; the hidden assumption of the latter is that the total miRNA expression is the same in all samples, which may not be true considering the small known number of miRNAs.
We analyzed the variation caused by labeling and detection using repetitive assays of the same RNA samples of a few cell lines originated from different tissues; these cell lines have different miRNA profiles. We plotted the standard deviation of each probe versus its means, after the data were log2-transformed (
We compared the data from expression profiles and northern blots on a panel of 7 cell lines; the same quantities of the same starting total RNAs were used for both analyses. We picked eight miRNAs that are expressed in any of these cell lines and that show differential expression according to the expression profiles, and probed them with northern blots. All eight display good concordance between the two assays (
We next examined the linearity of profiling (both labeling and detection) by measuring a series of starting materials, covering 0.5 μg to 10 μg of total RNAs from HEL cells. Most miRNAs report good linearity up to 3500 median fluorescence intensity readings (after normalization with pre-labeling-controls,
One common issue that affects hybridization-based analyses for miRNAs is the specificity of detection, since many miRNAs are closely-related on the sequence level. To assess the specificity of detection, we synthesized oligonucleotides corresponding to the reverse-transcription products of adaptor-ligated miRNAs, in this case the human let-7 family of miRNAs and a few artificial mutants. The sequences for these oligonucleotides are in Table 11, and the alignment of human let-7 miRNAs and mutant sequences are listed in Table 12. They were then labeled through PCR using the same primer sets. This provides a collection of sequence-pairs that differ by one, two, or a few nucleotides (
We applied this miRNA profiling platform for 140 human cancer specimens, 46 normal human tissues, and various cell lines. The collection of samples covers more than ten tissues and cancer types. This collection was referred to as miGCM (for miRNA Global Cancer Map). We first examined the miRNA expression profiles to see whether we can detect previously reported tissue-restricted expression of miRNAs. Indeed, we observed tissue-restricted expression patterns. For example, miR-122a, a reported liver-specific miRNA (Lagos-Quintana et al., 2002), is exclusively expressed in the liver samples, whereas miR-124a, a brain-specific miRNA (Lagos-Quintana et al., 2002), is abundantly expressed in the brain samples.
We performed hierarchical clustering on this data set, as described in the Methods. Hierarchical clustering is an unsupervised analysis tool that captures internal relationship between the samples. It organizes the samples (or features) into a tree structure (a dendrogram) according to the similarity between the samples (or the features). Close pairs of samples (ones with similar expression profiles) will generally be connected in the dendrogram at an earlier phase, while samples with larger distances (with less similar expression profiles) will be connected at a later phase (details can be found in Duda et al., 2000). The detailed result of hierarchical clustering on both the samples and features using correlation metrics is presented in
Comparison of miRNA and mRNA Clustering in Regard to GI Samples
After finding that the gastrointestinal tract samples were clustered together (Example 2 and
In order to test whether the lack of coherence of GI samples in the mRNA clustering is sensitive to the choice of genes that were used to represent each sample, we tested two additional gene filtering methods. First, we used a variation filter as was performed in Ramaswamy et al., 2001 (lower threshold of 20, upper threshold of 16000, the maximum value is at least 5 fold greater than the minimum value, and the maximum value is more than 500 greater than the minimum value), which yielded 6621 genes. Second, we examined only transcription factors, a set of gene regulators as are miRNAs. We took the genes that passed the above variation filter and that are also annotated with transcription factor activity in the Gene Ontology (GO:0003700). This resulted in 220 transcription factors as listed in the Table 13. Similar to the minimum-expression filter on the mRNA data, these two gene selection methods yielded clustering by tissue types to a certain degree. However, none recovered the gut coherence (
In order to build a classifier of normal samples vs. tumor samples based on the miGCM collection, we first picked tissues that have enough normal and tumor samples (at least 3 in each class). Table 14 summarizes the tissues for this analysis.
kNN (Duda et al., 2000) is a predicting algorithm that learns from a training data set (in this case, the above samples from the miGCM data set) and predicts samples in a test data set (in this case, the mouse lung sample set). A set of markers (features that best distinguishes two classes of samples, in this case, normal vs. tumor) was selected using the training data set. Distances between the samples were measured in the space of the selected markers. Prediction is performed, one test sample at a time, by: (i), identifying the k nearest samples (neighbors) of the test sample among the training data set; and (ii) assigning the test sample to the majority class of these k samples.
We first selected markers that best differentiate the normal and tumor samples (see Materials and Methods above) out of the 187 features that passed the filter (which was applied on the training set alone). This generated a list of 131 markers that each has a p-value <0.05 after Bonferroni correction; 129/131 markers are over-expressed in normal samples, whereas 2/131 are over-expressed in the tumor samples. Table 15 lists these markers.
These 131 markers were used without modification to predict the 12 mouse lung samples using the k-nearest neighbor algorithm. Each mouse sample was predicted separately, using log2 transformed mouse and human expression data. The tumor/normal phenotype prediction of a mouse sample was based on the majority type of the k nearest human samples using the chosen metric in the selected feature space. Since the tumor/normal distinction was observed at the raw miRNA expression levels, we decided to use Euclidean distance to measure the distances between samples. Thus, we performed kNN with the Euclidean distance measure and k=3, resulting in 100% accuracy. The detailed prediction results are available in Table 16. Similar classification results were obtained with other kNN parameters, with the exception of one mouse tumor T_MLUNG—5 (3rd column from right in
One hypothesis for the global decrease of miRNA expression in tumors (
We profiled the expression of miRNAs during erythroid differentiation in vitro to ask whether the increase in miRNA expression observed in the differentiation of HL-60 cells also occurs in primary cells. The accessibility of normal hematopoietic progenitor cells and the ability to recapitulate erythropoiesis in vitro provide a model to study normal differentiation. We purified CD34+ hematopoietic progenitor cells from umbilical cord blood. Erythroid differentiation was induced in vitro using a two phase liquid culture system. The state of differentiation of cultured cells was monitored every other day by evaluating expression of CD71 and glycophorin A (Gly-A) (
Analyzing Tissue Samples Using an mRNA Proliferation Signature
It is conceivable that differences in cellular proliferation, often integrally linked to differentiation, may contribute to the global miRNA signals. We asked whether the miRNA global expression differences among samples are merely a consequence of their differences in proliferation rates. To estimate the proliferation rates in tissue samples, we assembled a consensus mRNA signature of proliferation, reported to positively correlate with proliferation or mitotic index in breast tumors, lymphomas and HeLa cells (Alizadeh et al., 2000; Perou et al., 2000; Whitfield, et al., 2002). Table 18 summarizes this list.
We first asked whether the mRNA proliferation signature reflects proliferation rates in our samples. Indeed, we noticed that the mean expression of these mRNAs is higher in tumors than normal tissues (
Next, we examined in the tumor samples the expression of the mRNA proliferation signature. We focused on lung and breast, two tissues that we have sufficient numbers of poorly differentiated tumors and more differentiated tumors. It is important to point out that poorly differentiated tumors have globally lower miRNA expression than more differentiated tumors. However, we did not observe any difference in the mRNA proliferation signature between these two categories of samples (
RT-PCR Analyses of Genes Involved in miRNA Machinery
One possible mechanism of the observed global miRNA expression difference between normal samples and tumors is changes in expression levels of miRNA processing enzymes. In lung cancer, Dicer levels were reported to correlate with prognosis (Karube et al., 2005). We decided to examine Dicer1, Drosha, DGCR8 and Argonaute 2 (Ago2), which are critical in miRNA processing (Tomari et al., 2005). Lacking probe sets representing these genes in our mRNA data, we used quantitative RT-PCR and analyzed 79 samples (32 normal samples and 47 tumors, covering 8 tissues, including colon, breast, uterus, lung, kidney, pancreas, prostate and bladder). We normalized the quantitative PCR data with 18S rRNA levels. We performed Student's t-test (two-tail, unequal variance) for normal/tumor phenotypes on all samples examined (P=0.3 for Dicer1, P=0.11 for Drosha, P=0.0011 for DGCR8, P=0.0138 for Ago2). DGCR8 and Ago2 have significant nominal p-values under the above test. However, the fold differences of DGCR8 and Ago2 are small between tumors and normal samples (tumor samples have higher mean threshold cycle (Ct) values for these two genes; the mean Ct differences between normal and tumor samples are: 0.776 for DGCR8 and 0.798 for Ago2, corresponding to 1.7-fold and 1.5-fold absolute level differences respectively, after correction for PCR amplification efficiency). Whether or not the observed weak decreases on the transcript level may account for the differences in miRNA expression needs further investigation. It is also important to note that these results do not exclude the possibility that these miRNA machinery genes are involved in regulating tumor/normal miRNA expression in certain cancer types, or are regulated on the protein and activity levels.
We first set out to determine whether poorly differentiated tumors show a globally weaker miRNA expression than tumor samples in the miGCM collection, which represent more differentiated states. To this end, we made a comparison of poorly differentiated tumors to more differentiated tumors of the corresponding tissue types. The analysis was performed on 180 features, after the data were filtered to eliminate non-expressing miRNAs on the 55 samples which belong to tissue types that have both more-differentiated and poorly-differentiated samples (see the hierarchical clustering section in Supplementary Methods for data filtration).
We used PNN for prediction of tissue origin of poorly differentiated tumors. PNN is a probability based prediction algorithm and can be considered as a smooth version of kNN. For a multi-class prediction, PNN avoids the ambiguity often encountered with kNN, when multiple training classes are equally presented in the k nearest neighbors of a test sample. For a two-class classification problem, PNN assigns a probability for a test sample to be classified into one of the two classes. The contribution of each training sample to the classification of a test sample is related to their distance and follows the Gaussian distribution: the closer the test sample, the larger the contribution. The probability for a test sample to belong to a certain class is the total contribution from every training sample belonging to that class, divided by the total contributions of all training samples (see Materials and Methods for more details).
For the prediction of poorly differentiated tumors, the training sample set consists of 68 tumor samples with both miRNA and mRNA profiling data, covering 11 tissue types. The test set contains 17 poorly differentiated tumors. Table 19 summarizes the information on the 17 poorly differentiated tumors. To solve this multi-class prediction problem, we broke down the task into 11 two-class predictions. Each two-class prediction assigns a probability for a test sample to belong to a certain tissue-type vs. the rest of the tissue-types (one vs. the rest, OVR), for example, colon vs. non-colon. After performing OVR classifications for all 11 tissues, the one tissue-type that receives the highest probability marks the predicted tissue type. The prediction results are summarized in Table 20.
Error rates (%) of a k-nearest-neighbor classifier trained on IVT-GeneChip data to predict the true identity (tretinoin or DMSO) of eighty-eight test samples in the space of each of the nine gene classes from
Table 10a-10b
0.312
(1)
0
0.128
0.002
0.11
0.377
0.161
0.659
0.476
0
0
1
0.128
0.013
0
0.244
0.229
0.376
0
0
0.102
0.022
0.173
0.305
0.014
0.091
0.005
0
0
0.001
0.003
0
0.027
0.05
0.362
0.301
0.133
0
0.001
0.253
0
0
This application is a Continuation Application of U.S. Utility application Ser. No. 12/870,126, filed Aug. 27, 2010, now abandoned, which is a Continuation Application of U.S. Utility application Ser. No. 11/449,155, filed Jun. 8, 2006, now abandoned, which claims the benefit under 35 U.S.C. §119(e) of U.S. Provisional Patent Application Ser. No. 60/689,110 filed Jun. 8, 2005, the contents of which are herein incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
60689110 | Jun 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12870126 | Aug 2010 | US |
Child | 13780189 | US | |
Parent | 11449155 | Jun 2006 | US |
Child | 12870126 | US |