This invention relates to the field of diagnostic development, and in particular the development of chemogenomic signatures or biomarkers. The invention provides methods for identifying a “necessary” set of information rich variables from which a plurality of “sufficient” classifiers may be derived. In the field of biological diagnostics, the invention may be used to provide short lists of genes, referred to as “gene signatures” that may be used to carry out specific classification tasks such as predicting the activity and side effects of a compound in vivo.
A diagnostic assay typically consists of performing one or more measurements and then assigning a sample to one or more categories based on the results of the measurement(s). Thus, most diagnostic devices are simply two-class classifiers. The classifier can be a function of all or of a subset of the initial variables. The value of that function is calculated for each individual datum. The individual sample is assigned to one or the other class depending on whether the result of the classifier function exceeds a defined threshold.
Desirable attributes of a diagnostic assay include high sensitivity and specificity measured in terms of low false negative and false positive rates and overall accuracy. Because diagnostic assays are often used to assign large number of samples to given categories, the issues of cost per assay and throughput (number of assays per unit time or per worker hour) are of paramount importance.
Usually the development of a diagnostic assay involves the following steps: (1) define the class (i.e., the end point) to diagnose, (e.g., cholestasis, a pathology of the liver); (2) identify one or more variables (i.e., measurements) whose value correlates with the end point (e.g., elevation of bilirubin in the bloodstream as an indication of cholestasis); and (3) develop a specific, accurate, high-throughput and cost-effective device for making the specific measurements needed to predict or determine the endpoint.
Over the past 10 years, a variety of techniques have been developed that are capable of measuring a large number of different biological analytes (i.e., variables) simultaneously but which require relatively little optimization for any of the individual analyte detectors. Perhaps the most successful example is the DNA microarray, which may be used to measure the expression levels of thousands or even tens of thousands of genes simultaneously. Based on well-established hybridization rules, the design of the individual probe sequences on a DNA microarray now may be carried out in silicon and without any specific biological question in mind. Although DNA microarrays have been used primarily for pure research applications, this technology currently is being developed as a medical diagnostic device and everyday bioanalytical tool.
Although DNA microarrays are considerably more expensive than conventional diagnostic assays they do offer two critical advantages. First, they tend to be more sensitive, and therefore more discriminating and accurate in prediction than most current diagnostic techniques. Using a DNA microarray, it is possible to detect a change in a particular gene's expression level earlier, or in response to a milder treatment than is possible with more classical pathology markers. Also, it is possible to discern combinations of genes or proteins useful for resolving subtle differences in forms of an otherwise more generic pathology. Second, because of their massively parallel design, DNA microarrays make it possible to answer many different diagnostic questions. In addition, by using different combinations of variables that may be available on an array, it may be possible to confirm the answer to a single classification question in multiple independent ways and thereby increase accuracy.
A key challenge in developing the DNA microarray as a diagnostic tool lies in accurately interpreting the large amount of multivariate data provided by each measurement (i.e., each probe's hybridization). Indeed, commercially available high density DNA microarrays (also referred to as “gene chips” or “biochips”) allow one to collect thousands of gene expression measurements using standardized published protocols. However, typically only a very small fraction of these measurements are relevant to a given diagnostic classification question being asked by the user. For example, only 10-20 genes (out of 10,000 available on the microarray) may be used as the gene signature for a specific question. Thus, current DNA microarrays provide a large amount of information that is not used for answering most typical diagnostic assay questions. Similar data overload problems exist in adapting other highly multiplexed bioassays such as RT-PCR or proteomic mass spectrometry to diagnostic applications.
A recently developed powerful new application for the DNA microarray is chemogenomic analysis. The term “chemogenomics” refers to the transcriptional and/or bioassay response of one or more genes upon exposure to a particular chemical compound. A comprehensive database of chemogenomic annotations for large numbers of genes in response to large numbers of chemical compounds may be used to design and optimize new pharmaceutical lead compounds based only on a transcriptional and biomolecular profile of the known (or merely hypothetical) compound. For example, a small number of rats may be treated with a novel lead compound and then expression profiles measured for different tissues from the compound treated animals using DNA microarrays. Based on the correlative analysis of this compound treatment expression level data with respect to the chemogenomic reference database, it may be possible to predict the toxicological profile and/or likely off-target effects of the new compound. Construction of a comprehensive chemogenomic database and methods for chemogenomic analysis using microarrays are described in Published U.S. patent application No. 2005/0060102 A1, which is hereby incorporated herein by reference in its entirety.
Systematic “mining” of large chemogenomic datasets has led to the discovery of new relationships between genes. It has also led to new insight into the genes and pathways affected by particular classes of compound treatments. An important tool for discovering these new relationships are specific, short weighted lists of genes that may be used to determine whether certain gene expression changes are related (i.e., whether the observed effects are in the same class). These gene lists, referred to as “gene signatures,” provide simple, robust tools for answering classification questions using DNA microarrays. Methods for deriving and using gene signatures to analyzed chemogenomic data are disclosed in Published U.S. patent application No. 2005/0060102 A1 and PCT Publication No. WO 2004/037200, each of which is hereby incorporated herein by reference in its entirety.
The use of gene signatures to answer diagnostic questions is not limited to the DNA hybridization assay context. The general concept of signatures may be widely applied to any analytical testing situation that may be reduced to a question of whether data are within or outside a specific class.
Even with robust gene signatures, however, sometimes data are measured that defy simple classification algorithms. That is, the signature does not clearly place the data in either of the two classes it defines. This may be due to the nature of the data originally used to derive the signature (i.e., the signature is not robust enough) or it may indicate that the data defines a new class. New methods are needed to derive signatures capable of classifying this type of “borderline” data. The availability of improved signatures would greatly increase the usefulness of these signatures as accurate and reliable tools for diagnostic classification.
In one embodiment, the present invention provides a method of selecting a set of necessary variables useful for answering a classification question comprising: (a) providing a full multivariate dataset; (b) querying the full dataset with a classification question so as to generate a first linear classifier comprising a first set of variables and capable of performing with a log odds ratio greater than or equal to a selected threshold value (e.g., log odds ratio greater than or equal to 4.0); and (c) removing the first set of variables from the full dataset thereby generating a partially depleted dataset; (d) querying the partially depleted dataset with the classification question so as to generate a second linear classifier comprising a second set of variables; repeating steps c and d until the linear classifier generated is not capable of performing with a log odds ratio greater than or equal to the selected threshold (or second different threshold); and selecting the variables of the linear classifiers meeting the performance threshold; wherein the remaining fully depleted subset of variables is unable to answer the classification question with a log odds ratio greater than the selected threshold. In one preferred embodiment, a single log odds ratio threshold of greater than or equal to 4.0 is used. In an alternative embodiment of the method, a second threshold may be selected and used to determine the performance of the remaining variables when repeating steps c and d. In one embodiment, the method may be carried out wherein the multivariate dataset comprises chemogenomics data, and specifically, comprises a dataset from polynucleotide array experiments on compound-treated samples. In another preferred embodiment of the above method, the linear classifiers are sparse, that is they are composed of short gene lists. In a preferred embodiment, the sparse linear classifiers are generated with an algorithm selected from the group consisting of SPLP, SPLR and SPMPM. In another embodiment the above method is carried out with a multivariate dataset comprising data from a proteomic or metabolomic experiment.
The present invention also includes a set of necessary variables for answering classification questions made according to the method described above. Necessary sets of the invention may be quite large and include all or nearly all variables in the full set of variables. In preferred embodiments, the variables in the necessary sets of the invention are genes and number fewer than 400, 300, 200, 100, or 50 genes In one preferred embodiment, the necessary sets of variables of the present invention number fewer than 4%, 3%, 2%, 1% or 0.5% of the total number of genes present on a typical DNA microarray that includes on the order of 8,000, 10,000 or even 20,000 or more genes.
The present invention also includes an array, or other diagnostic device, comprising a set of polynucleotides each representing a gene in the necessary set made according to the method described above.
In another embodiment, the invention includes a diagnostic reagent set useful in diagnostic assays and diagnostic kits for a specific classification question comprising a set of polynucleotides each representing a gene in the necessary set made according to the above method.
In another embodiment, the invention includes a subset of genes useful for answering a chemogenomic classification question (including those questions disclosed in Table 2) comprising a percentage of genes randomly selected from necessary set made according to method described above, wherein the addition of the percentage of genes to the depleted set for the classification question increases the average logodds ratio of the linear classifiers generated by the depleted set. In some embodiments, the subset may be defined according to the percentage increase in the average LOR performance of the depleted set, in other embodiments, the increase corresponds to a set average LOR threshold.
In one specific embodiment, the subset of genes is useful for answering the monoamine re-uptake (SERT) inhibitor classification question and the necessary set consists of the 311 genes listed in Table 5. In one preferred embodiment, the subset comprises a randomly selected 15% of genes from the 311 in the SERT necessary set and the average logodds ratio is increased to greater than or equal to 3.0. In another preferred embodiment, the subset comprises a randomly selected 26% of genes from the 311 in the SERT necessary set and the average logodds ratio is increased to greater than or equal to 4.0.
In another embodiment, the invention includes a diagnostic assay comprising a set of secreted proteins encoded by the genes of a necessary set made according to the above-described method (e.g., an array of immobilized receptors), or an assay comprising reagents capable of detecting secreted proteins encoded by the genes of a necessary set.
In another embodiment, the invention provides a method for preparing a reagent set comprising the steps of: (a) deriving a first linear classifier comprising a first set of genes from a full dataset, wherein said first linear classifier is capable of answering a classification question with a log odds ratio greater than or equal to a first selected threshold value; (b) removing said first set of genes from the full dataset thereby resulting in a partially depleted chemogenomic dataset; (c) deriving a second linear classifier comprising a second set of genes from the partially depleted dataset, wherein the second linear classifier capable of answering a classification question with a log odds ratio greater than or equal to a second selected threshold value; (d) removing said second set of genes from the partially depleted dataset; (e) preparing a plurality of isolated polynucleotides or polypeptides, wherein each polynucleotide or polypeptide is capable of detecting at least one gene of said first and second sets genes. This method of preparing a reagent set may further include the steps of: after step (d) repeating the steps of (i) deriving a linear classifier; and (ii) removing each additional linear classifier's set of genes from the partially depleted dataset; until the partially depleted dataset is not capable of generating a linear classifier with a log odds ratio greater than or equal to the second selected threshold value.
In another embodiment, the invention provides a reagent set for analysis of a chemogenomic classification question comprising a set of polynucleotides or polypeptides representing a plurality of genes, wherein a random selection of at least 10% of said plurality of genes restores the ability of a depleted set to generate signatures for the classification question with an average LOR greater than or equal to 4.0, wherein the depleted set cannot generate a signature with an average LOR of greater than 1.2,. In other embodiments, the reagent set represents a plurality of genes, wherein the random selection capable of restoring the ability of the depleted set is of at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75% or 80% of said plurality of genes. In other embodiments, the reagent set represents a plurality of genes, whether a random selection of at least 10% of said plurality of genes restores the ability of a depleted set to generate signatures for the classification question with an average LOR greater than or equal to 3.0, 4.0, 5.0, 6.0, 7.0, or 8.0. In one embodiment, the reagent set comprises polypeptides represent genes capable of detected secreted proteins.
In another embodiment, the invention provides a set of necessary variables for answering a classification question comprising the variables whose removal from a full multivariate dataset results in a depleted set of variables that are unable to answer the classification question with a performance greater than some selected threshold (e.g., log odds ratio greater than or equal to 4.0). In preferred embodiments, the variables in the necessary sets of the invention are genes and number fewer than 400, 300, 200, 100, 50 or even 25 genes. In one preferred embodiment, the necessary sets of variables of the present invention are genes and number fewer than 4%, 3%, 2%, 1% or 0.5% of the total number of genes present in a complete set of 8,000, 10,000 or even 20,000 or more genes.
In another embodiment, the invention includes a diagnostic device (e.g., an array), a diagnostic reagent set, or a diagnostic kit, useful for answering a classification question, comprising a set of polynucleotides representing a plurality of genes, wherein removal of the plurality of genes from a full DNA array dataset results in a depleted set of genes that is unable to generate signatures for the classification question with an average log odds ratio greater than or equal to a chosen threshold. In other embodiments, the chosen threshold is an average LOR greater than or equal to 3.0, 4.0, 5.0, 6.0, 7.0, or 8.0.
In an alternative embodiment, the invention provides a diagnostic device comprising a set of secreted proteins encoded by the genes in the necessary set for a specific classification question or a set of reagents capable of detecting said secreted proteins.
In one embodiment, the present invention provides a method of identifying non-overlapping sufficient sets of variables useful for answering a classification question comprising: providing a full multivariate dataset; querying the full dataset with a classification question so as to generate a first linear classifier capable of performing with a log odds ratio greater than or equal to a chosen threshold and comprising a first set of variables; removing the first set of variables from the full dataset thereby generating a partially depleted dataset; and querying the partially depleted dataset with the classification question so as to generate a second linear classifier capable of performing with a log odds ratio greater than or equal to a chosen threshold and comprising a second set of variables; wherein none of the variables in the second set overlaps the variables in the first set.
In one embodiment, the method of identifying non-overlapping sufficient sets may be carried out wherein the multivariate dataset comprises chemogenomics data, and specifically, comprises a dataset from polynucleotide array experiments on compound-treated samples. In another preferred embodiment of the above method, the linear classifiers are reducible to weighted gene lists. In another embodiment the above method is carried out with a multivariate dataset comprising data from a proteomic experiment.
The present invention also provides a method of classifying experimental data comprising: providing at least two non-overlapping sufficient sets of variables useful for answering a classification question; querying the experimental data with one of the at least two non-overlapping sufficient sets of variables; querying the experimental data with another of the at least two non-overlapping sufficient sets of variables; wherein the classification of the data is determined based on the answers to the queries generated by the at least two non-overlapping sets of variables.
FIGS. 2(A) and (B) depict results of repeatedly applying the stripping algorithm for four different classification questions used to query a chemogenomic dataset. Four signatures were chosen. One of them, used here as a control (NSAID, Cox2/1, coxib-like) failed at the 2nd cycle in the previous analysis (Classification #39 in Table 3). (A) shows the evolution of the Test Log Odds Ratio as function of the cycles of stripping. (B) shows the cumulative number of genes used.
The present invention provides a method of defining a “necessary” set of variables from which multiple independent classifiers (e.g., gene signatures) may be derived. Using multiple independent signatures for the same classification question in a single classification experiment (e.g., in a single microarray assay) it is possible to analyze “borderline” data more accurately. For example, two non-overlapping gene signatures that classify a specific type of pathway inhibitors may be used to reach a consensus classification for a particular compound that does not score highly with either signature alone.
In addition, the necessary set itself, which may be derived for any classification question according to the methods disclosed herein, represents a source of information rich variables that may be used to prepare diagnostic devices. As shown herein, even a small percentage of genes randomly selected from the necessary set for a specific classification question may be used to “revive” a depleted dataset.
In addition to providing an improved diagnostic tool, the comparative analysis of the multiple independent and/or non-overlapping signatures that exist within a “necessary” set of variables, can provide insight into structural and functional features of the full dataset from which the signatures are derived. For example, by using a method of sequentially “stripping” away gene signatures from the full dataset to reveal underlying gene signatures associated with distinct metabolic pathways. These distinct and independent signatures can provide an alternative signature useful for development of a novel diagnostic test. Thus, the present invention provides tools to develop novel toxicology or pharmacology signatures, or diagnostic assays.
“Multivariate dataset” as used herein, refers to any dataset comprising a plurality of different variables including but not limited to chemogenomic datasets comprising logratios from differential gene expression experiments, such as those carried out on polynucleotide microarrays, or multiple protein binding affinities measured using a protein chip. Other examples of multivariate data include assemblies of data from a plurality of standard toxicological or pharmacological assays (e.g., blood analytes measured using enzymatic assays, antibody based ELISA or other detection techniques).
“Variable” as used herein, refers to any value that may vary. For example, variables may include relative or absolute amounts of biological molecules, such as mRNA or proteins, or other biological metabolites. Variables may also include dosing amounts of test compounds.
“Classifier” as used herein, refers to a function of a set of variables that is capable of answering a classification question. A “classification question” may be of any type susceptible to yielding a yes or no answer (e.g., “Is the unknown a member of the class or does it belong with everything else outside the class?”). “Linear classifiers” refers to classifiers comprising a first order function of a set of variables, for example, a summation of a weighted set of gene expression logratios. A valid classifier is defined as a classifier capable of achieving a performance for its classification task at or above a selected threshold value. For example, a log odds ratio ≧4.00 represents a preferred threshold of the present invention. Higher or lower threshold values may be selected depending of the specific classification task.
“Signature” as used herein, refers to a combination of variables, weighting factors, and other constants that provides a unique value or function capable of answering a classification question. A signature may include as few as one variable. Signatures include but are not limited to linear classifiers comprising sums of the product of gene expression logratios by weighting factors and a bias term.
“Weighting factor”(or “weight”) as used herein, refers to a value used by an algorithm in combination with a variable in order to adjust the contribution of the variable.
“Impact factor” or “Impact” as used herein in the context of classifiers or signatures refers to the product of the weighting factor by the average value of the variable of interest. For example, where gene expression logratios are the variables, the product of the gene's weighting factor and the gene's measured expression log10 ratio yields the gene's impact. The sum of the impacts of all of the variables (e.g., genes) in a set yields the “total impact” for that set.
“Scalar product”(or “Signature score”) as used herein refers to the sum of impacts for all genes in a signature less the bias for that signature. A positive scalar product for a sample indicates that it is positive for (i.e., a member of) the classification that is determined by the classifier or signature.
“Sufficient set” as used herein is a set of variables (e.g., genes, weights, bias factors) whose cross-validated performance for answering a specific classification question is greater than an arbitrary threshold (e.g., a log odds ratio ≧4.0).
“Necessary set” as used herein is a set of variables whose removal from the full set of all variables results in a depleted set whose performance for answering a specific classification question does not rise above an arbitrarily defined minimum level (e.g., log odds ratio ≧4.00).
“Log odds ratio” or “LOR” is used herein to summarize the performance of classifiers or signatures. LOR is defined generally as the natural log of the ratio of the odds of predicting a subject to be positive when it is positive, versus the odds of predicting a subject to be positive when it is negative. LOR is estimated herein using a set of training or test cross-validation partitions according to the following equation,
where c (typically c=40 as described herein) equals the number of partitions, and TPi, TNi, FPi, and FNi represent the number of true positive, true negative, false positive, and false negative occurrences in the test cases of the ith partition, respectively.
“Array” as used herein, refers to a set of different biological molecules (e.g., polynucleotides, peptides, carbohydrates, etc.). An array may be immobilized in or on one or more solid substrates (e.g., glass slides, beads, or gels) or may be a collection of different molecules in solution (e.g., a set of PCR primers). An array may include a plurality of biological polymers of a single class (e.g., polynucleotides) or a mixture of different classes of biopolymers (e.g., an array including both proteins and nucleic acids immobilized on a single substrate).
“Array data” as used herein refers to any set of constants and/or variables that may be observed, measured or otherwise derived from an experiment using an array, including but not limited to: fluorescence (or other signaling moiety) intensity ratios, binding affinities, hybridization stringency, temperature, buffer concentrations.
“Proteomic data” as used herein refers to any set of constants and/or variables that may be observed, measured or otherwise derived from an experiment involving a plurality of mRNA translation products (e.g., proteins, peptides, etc) and/or small molecular weight metabolites or exhaled gases associated with these translation products.
Sparse linear classifiers may be used to classify large multivariate datasets from DNA microarray experiments. Sparse as used here means that the vast majority of the variables have zero weight. Sparsity ensures that the sufficient and necessary gene lists produced by the methodology described above are as short as possible. The output is a short weighted gene list (i.e., a gene signature) capable of assigning an unknown treatment to one of two classes. The sparsity and linearity of the classifiers are important features. The linearity of the classifier facilitates the interpretation of the signature—the contribution of each gene to the classifier corresponds to the product of its weight and the value (i.e., logratio) from the microarray experiment. The property of sparsity ensures that the classifier uses only a few genes, which also helps in the interpretation. More importantly, however, because of sparsity the classifier may be reduced to a practical diagnostic device comprising a relatively small set of genes.
A linear classifier generated according to this invention is “sufficient” to classify. In fact, it may be the best list derivable by the algorithm for the task. Significantly, it may be possible to define other gene lists, possibly not overlapping with the first list that can classify the same data. Those other lists likely exhibit a lower performance than the initial list but may still perform better than a given threshold of performance.
The invention provides a method to derive multiple non-overlapping gene signatures for a given question. Because these non-overlapping signatures use different genes they may be used to provide an independent confirmation of the class assignment of an individual sample. Consequently, this method is useful to confirm that an unknown is a member of a given class or to confirm that a known individual is not a member of a class.
The present invention provides a method to identify all of the genes “necessary” to create a classifier that performs above a certain minimal threshold level for a specific classification question. The method also leads to a separate set of “depleted” genes which cannot be used to create a valid linear classifier for a given question.
A. Multivariate Datasets
a. Various Useful Multivariate Data Types
The present invention may be used with a wide range of multivariate data types to identify necessary and sufficient sets of variables useful for generating linear classifiers.
The smaller circles (103-106) inside the necessary set box depicted in
A preferred embodiment is the application of the present invention with data generated by high-throughput biological assays such as DNA array experiments, or proteomic assays. For example, as larger multivariate data sets are assembled for large sets of molecules (e.g., small or large chemical compounds) the present method may be applied to these datasets to allow facile generation of multiple, non-overlapping linear classifiers. The large datasets may include any sort of molecular characterization information including, e.g., spectroscopic data (e.g., UV-Vis, NMR, IR, mass spectrometry, etc.), structural data (e.g., three-dimensional coordinates) and functional data (e.g., activity assays, binding assays). The classifiers produced by using the present invention with such a dataset be applied in a multitude of analytical contexts, including the development and manufacture of derivative detection devices (i.e., diagnostics). For example, one may use the present invention with a large multivariate dataset of human metabolite levels to generate classifiers useful in a simplified device for detecting various different ingested toxins used by emergency medical personnel.
Generally, the present invention will be useful wherever it is necessary to simplify data classification. One of ordinary skill will recognize that the methods of the present invention may be applied to multivariate data in areas outside of biotechnology, chemistry, pharmaceutical or the life sciences. For example, the present invention may be used in physical science applications such as climate prediction, or oceanography, where it is essential to prepare simple signatures capable of being used for detection.
Large dataset classification problems are common in the finance industry (e.g., banks, insurance companies, stock brokers, etc.) A typical finance industry classification question is whether to grant a new insurance policy (or home mortgage) versus not. The variables to consider are any information available on the prospective customer or, in the case of stock, any information on the specific company or even the general state of the market. The finance industry equivalent to the “gene signatures” described in the Examples below would be financial signatures for a specific financing decision. The present invention would identify a necessary set of financial variables useful for generating financial signatures capable of answering a specific financing question.
b. Construction of a Multivariate Dataset
As discussed above, the method of the present invention may be used to identify necessary and sufficient subsets of responsive variables within any multivariate data set that are useful for answering classification questions. In preferred embodiments the multivariate dataset comprises chemogenomic data. For example, the data may correspond to treatments of organisms (e.g., cells, worms, frogs, mice, rats, primates, or humans etc.) with chemical compounds at varying dosages and times followed by gene expression profiling of the organism's transcriptome (e.g., measuring mRNA levels) or proteome (e.g., measuring protein levels). In the case of multicellular organisms (e.g., mammals) the expression profiling may be carried out on various tissues of interest (e.g., liver, kidney, marrow, spleen, heart, brain, intestine). Typically, valid sufficient classifiers or signatures may be generated that answer questions relevant to classifying treatments in a single tissue type. The present specification describes examples of necessary and sufficient sets of genes useful for classifying chemogenomic data in liver tissue. The methods of the present invention may also be used however, to generate signatures in any tissue type. In some embodiments, classifiers or signatures may be useful in more than one tissue type. Indeed, a large chemogenomic dataset, like that exemplified in Example 1 may reveal gene signatures in one tissue type (e.g., liver) that also classify pathologies in other tissues (e.g., intestine).
In addition to the expression profile data, the present invention may be useful with chemogenomic datasets including additional data types such as data from classic biochemistry assays carried out on the organisms and/or tissues of interest. Other data included in a large multivariate dataset may include histopathology, pharmacology assays, and structural data for the chemical compounds of interest. Such a multi-data type database permits a series of correlations to be made across data types, thereby providing insights not possible otherwise. For example, a histopathology may be correlated with an expression pattern which is then correlated with an off-target pathway of a class of compound structures. One example of a chemogenomic multivariate dataset particularly useful with the present invention is a dataset based on DNA array expression profiling data as described in U.S. patent application Ser. No. 09/977,064 filed Oct. 11, 2001 (titled “Interactive Correlation of Compound Information and Genomic Information”), which is hereby incorporated by reference for all purposes. Microarrays are well known in the art and consist of a substrate to which probes that correspond in sequence to genes or gene products (e.g., cDNAs, mRNAs, cRNAs, polypeptides, and fragments thereof), can be specifically hybridized or bound at a known position. The microarray is an array (i.e., a matrix) in which each position represents a discrete binding site for a gene or gene product (e.g., a DNA or protein), and in which binding sites are present for many or all of the genes in an organism's genome.
As disclosed above, a treatment may include but is not limited to the exposure of a biological sample or organism (e.g., a rat) to a drug candidate (or other chemical compound), the introduction of an exogenous gene into a biological sample, the deletion of a gene from the biological sample, or changes in the culture conditions of the biological sample. Responsive to a treatment, a gene corresponding to a microarray site may, to varying degrees, be (a) up-regulated, in which more mRNA corresponding to that gene may be present, (b) down-regulated, in which less mRNA corresponding to that gene may be present, or (c) unchanged. The amount of up-regulation or down-regulation for a particular matrix location is made capable of machine measurement using known methods (e.g., fluorescence intensity measurement). For example, a two-color fluorescence detection scheme is disclosed in U.S. Pat. Nos. 5,474,796 and 5,807,522, both of which are hereby incorporated by reference herein. Single color schemes are also well known in the art, wherein the amount of up- or down-regulation is determined in silico by calculating the ratio of the intensities from the test array divided by those from a control.
After treatment and appropriate processing of the microarray, the photon emissions are scanned into numerical form, and an image of the entire microarray is stored in the form of an image representation such as a color JPEG or TIFF format. The presence and degree of up-regulation or down-regulation of the gene at each microarray site represents, for the perturbation imposed on that site, the relevant output data for that experimental run or scan.
The methods for reducing datasets disclosed herein are broadly applicable to other gene and protein expression data. For example, in addition to microarray data, biological response data including gene expression level data generated from serial analysis of gene expression (SAGE, supra) (Velculescu et al., 1995, Science, 270:484) and related technologies are within the scope of the multivariate data suitable for analysis according to the method of the invention. Other methods of generating biological response signals suitable for the preferred embodiments include, but are not limited to: traditional Northern and Southern blot analysis; antibody studies; chemiluminescence studies based on reporter genes such as luciferase or green fluorescent protein; Lynx; READS (GeneLogic); and methods similar to those disclosed in U.S. Pat. No. 5,569,588 to Ashby et. al., “Methods for drug screening,” the contents of which are hereby incorporated by reference into the present disclosure.
In another preferred embodiment, the large multivariate dataset may include genotyping (e.g., single-nucleotide polymorphism) data. The present invention may be used to generate necessary and sufficient sets of variables capable of classifying genotype information. These signatures would include specific high-impact SNPs that could be used in a genetic diagnostic or pharmacogenomic assay.
The method of generating classifiers from a multivariate dataset according to the present invention may be aided by the use of relational database systems (e.g., in a computing system) for storing and retrieving large amounts of data. The advent of high-speed wide area networks and the internet, together with the client/server based model of relational database management systems, is particularly well-suited for meaningfully analyzing large amounts of multivariate data given the appropriate hardware and software computing tools. Computerized analysis tools are particularly useful in experimental environments involving biological response signals. Generally, multivariate data may be obtained and/or gathered using typical biological response signals. Responses to biological or environmental stimuli may be measured and analyzed in a large-scale fashion through computer-based scanning of the machine-readable signals, e.g., photons or electrical signals, into numerical matrices, and through the storage of the numerical data into relational databases. For example a large chemogenomic dataset may be constructed as described in U.S. patent application Ser. No. 09/977,064 filed Oct. 11, 2001 (titled “Interactive Correlation of Compound Information and Genomic Information”) which is hereby incorporated by reference for all purposes.
B. Generating Valid Classifiers from a Dataset
a. Mining of a Large Multivariate Dataset for Classifiers
Generally classifiers or signatures are generated (i.e., mined) from a large multivariate dataset by first labeling the full dataset according to known classifications and then applying an algorithm to the full dataset that produces a linear classifier for each particular classification question. Each signature so generated is then cross-validated using a standard split sample procedure.
The initial questions used to classify (i.e., the classification questions) a large multivariate dataset may be of any type susceptible to yielding a yes or no answer. The general form of such questions is: “Is the unknown a member of the class or does it belong with everything else outside the class?” For example, in the area of chemogenomic datasets, classification questions may include “mode-of-action” questions such as “All treatments with drugs belonging to a particular structural class versus the rest of the treatments” or pathology questions such as “All treatments resulting in a measurable pathology versus all other treatments.” In the specific case of chemogenomic datasets based on gene expression, it is preferred that the classification questions are further categorized based on the tissue source of the gene expression data. Similarly, it may be helpful to subdivide other types of large data sets so that specific classification questions are limited to particular subsets of data (e.g., data obtained at a certain time or dose of test compound). Typically, the significance of subdividing data within large datasets become apparent upon initial attempts to classify the complete dataset. A principal component analysis of the complete data set may be used to identify the subdivisions in a large dataset (see e.g., US 2003/0180808 A1, published Sep. 25, 2003, which is hereby incorporated by reference herein.) Methods of using classifiers to identify information rich genes in large chemogenomic datasets is also described in U.S. Ser. No. 11/114,998, filed Apr. 25, 2005, which is hereby incorporated by reference herein for all purposes.
Labels are assigned to each individual (e.g., each compound treatment) in the dataset according to a rigorous rule-based system. The +1 label indicates that a treatment falls in the class of interest, while a −1 label indicates that the variable is outside the class. Information used in assigning labels to the various individuals to classify may include annotations from the literature related to the dataset (e.g., known information regarding the compounds used in the treatment), or experimental measurements on the exact same animals (e.g., results of clinical chemistry or histopathology assays performed on the same animal).
As more detailed description of 101 classification questions directed to liver tissue are provided in Table 2 in the Examples section below. The “Classification Name” column lists an abbreviated name or description for the particular classification. “Tissue” indicates the tissue from which the signature was derived. Generally, the gene signature works best for classifying gene expression data from tissue samples from which it was derived. In the present example, all 101 signatures generated are valid in liver tissue. The “Universe Description” is a description of the samples that will be classified by the signature. The chemogenomic dataset described in Example 1 contains information from several tissue types at multiple doses and multiple time points. In order to derive gene signatures it is often useful to restrict classification to only parts of the dataset. So for example, it often is useful to restrict classification to a signature tissue. Other common restrictions are to specific time points, for example day 3 or day 5 time points. The “Universe Description” contains phrases like “Tissue=Liver and Timepoint>=3” which, translates into a restriction that the signature will be derived from compound treatments measured by gene expression analysis of liver tissue on days 3,5 or 7 (or later if available). Other phrases might say, “Not (Activity_Class_Union=***BLANK***)” which translates into a restriction that any treatment for which the compound has not been annotated with an “Activity_Class_Union” be excluded from the Universe definition. “Class+1 Description” lists descriptions of the definition of the compound treatments in the chemogenomic database that were labeled in the positive group for deriving the signature. “Class−1 description” is the description of the compound treatments that were labeled as not in the class for deriving the signature. “Class 0 description” are the compound treatments that were not used to derive the gene signature. The 0 label is used to exclude compounds for which the +1 or −1 label is ambiguous. For example, in the case of a literature pharmacology signature, there are cases where the compound is neither an agonist or an antagonist but rather a partial agonist. In this case, the safe assumption is to derive a gene signature without including the gene expression data for this compound treatment. Then the gene signature may be used to classify the ambiguous compound after it has been derived. “LOR” refers to the average logodds ratio which is a measure of the performance of each signature.
As listed in Table 2, there are several different types of class descriptions used to characterize the classification questions. “Structure Activity Class” (SAC) is a description of both the chemical structure and the pharmacological activity of the compound. Thus, for example, estrogen receptor agonists form one group. Another example: bacterial DNA gyrase inhibitor, 8-fluoro-fluoroquinolone and 8-alkoxy-fluoroquinolone antibiotics each form separate SAC classes even though both share the same pharmacological target, DNA gyrase. “Activity_Class_Union” (also referred to as “Union Class”) is a higher level description of several SAC classes. For example, the DNA gyrase Union Class would include both 8-fluoro-fluoroquinolone and 8-alkoxy-fluoroquinolone antibiotics.
Compound activities are also referred to in the class descriptions listed in Table 2. The exact assay referred to in each activity measurement is encoded as “IC50-XXXXX|Assay name,” where xxxxx is the catalog number for the assay in the MDS-Pharma Services on-line catalog found at URL “discovery.mdsps.com/catalog”. Thus, for example, “IC50-21950|Dopamine D1” indicates the Dopamine D1 assay with the MDS catalog number 21950. All compound activities are reported as −log(IC50), where the IC50 is reported in μM. Therefore, “>=0.000000000001” indicates that the value should be greater than zero and thus greater than 1 μm (i.e. since log(1 μM)=0). Furthermore, the testing protocols used in constructing the database of Example 1 did not determine IC50 values greater than about 35 μM. All cases where the IC50 was estimated to be greater than 35 μm was recorded in the database as “−3” (i.e. the IC50 was considered to be 1 nM and thus, −log(1000 μM)=−3). This number implies that the compound does not bind to the site under investigation.
b. Algorithms for Generating Valid Classifiers
Dataset classification may be carried out manually, that is by evaluating the dataset by eye and classifying the data accordingly. However, because the dataset may involve tens of thousands (or more) individual variables, more typically, querying the full dataset with a classification question is carried out in a computer employing any of the well-known data classification algorithms.
In preferred embodiments, algorithms are used to query the full dataset that generate linear classifiers. In particularly preferred embodiments the algorithm is selected from the group consisting of: SPLP, SPLR and SPMPM. These algorithms are based respectively on Support Vector Machines (SVM), Logistic Regression (LR) and Minimax Probability Machine (MPM). They have been described in detail elsewhere (See e.g., El Ghaoui et al., op. cit; Brown, M. P., W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, Jr., and D. Haussler, “Knowledge-based analysis of microarray gene expression data by using support vector machines,” Proc Natl Acad Sci U.S.A. 97: 262-267 (2000)).
Generally, the sparse classification methods SPLP, SPLR, SPMPM are linear classification algorithms in that they determine the optimal hyperplane separating a positive and a negative class. This hyperplane, H can be characterized by a vectorial parameter, w (the weight vector) and a scalar parameter, b (the bias): H={x|wTx+=0}.
For all proposed algorithms, determining the optimal hyperplane reduces to optimizing the error on the provided training data points, computed according to some loss function (e.g., the “Hinge loss,” i.e., the loss function used in 1-norm SVMs; the “LR loss;” or the “MPM loss” augmented with a 1-norm regularization on the signature, w. Regularization helps to provide a sparse, short signature. Moreover, this 1-norm penalty on the signature will be weighted by the average standard error per gene. That is, genes that have been measured with more uncertainty will be less likely to get a high weight in the signature. Consequently, the proposed algorithms lead to sparse signatures, and take into account the average standard error information.
Mathematically, the algorithms can be described by the cost finctions (shown below for SPLP, SPLR and SPMPM) that they actually minimize to determine the parameters w and b.
The first term minimizes the training set error, while the second term is the 1-norm penalty on the signature w, weighted by the average standard error information per gene given by sigma. The training set error is computed according to the so-called Hinge loss, as defined in the constraints. This loss function penalizes every data point that is closer than “1” to the separating hyperplane H, or is on the wrong side of H. Notice how the hyperparameter rho allows trade-off between training set error and sparsity of the signature w.
The first term expresses the negative log likelihood of the data (a smaller value indicating a better fit of the data), as usual in logistic regression, and the second term will give rise to a short signature, with rho determining the trade-off between both.
Here, the first two terms, together with the constraint are related to the misclassification error, while the third term will induce sparsity, as before. The symbols with a hat are empirical estimates of the covariances and means of the positive and the negative class. Given those estimates, the misclassification error is controlled by determining w and b such that even for the worst-case distributions for the positive and negative class (which we do not exactly know here) with those means and covariances, the classifier will still perform well. More details on how this exactly relates to the previous cost function can be found in e.g., El Ghaoui et al., op. cit.
As mentioned above, classification algorithms capable of producing linear classifiers are preferred for use with the present invention. In the context of chemogenomic datasets, linear classifiers may be used to generate one or more valid signatures capable of answering a classification question comprising a series of genes and associated weighting factors. Linear classification algorithms are particularly useful with DNA array or proteomic datasets because they provide simplified signatures useful for answering a wide variety of questions related to biological function and pharmacological/toxicological effects associated with genes or proteins. These signatures are particularly useful because they are easily incorporated into wide variety of DNA- or protein-based diagnostic assays (e.g., DNA microarrays).
However, some classes of non-linear classifiers, so called kernel methods, may also be used to develop short gene lists, weights and algorithms that may be used in diagnostic device development; while the preferred embodiment described here uses linear classification methods, it specifically contemplates that non-linear methods may also be suitable.
Classifications may also be carried using principle component analysis and/or discrimination metric algorithms well-known in the art (see e.g., US 2003/0180808 A1, published Sep. 25, 2003, which is hereby incorporated by reference herein).
c. Cross-Validation of Classifiers
Cross-validation of signature performance is an important step for identifying sufficient signatures. Cross-validation may be carried out by first randomly splitting the full dataset (e.g., a 60/40 split). A training signature is derived from the training set composed of 60% of the samples and used to classify both the training set and the remaining 40% of the data, referred to herein as the test set. In addition, a complete signature is derived using all the data. The performance of these signatures can be measured in terms of log odds ratio (LOR) or the error rate (ER) defined as:
LOR=ln(((TP+0.5)*(TN+0.5))/((FP+0.5)*(FN+0.5)))
and
ER=(FP+FN)/N;
where TP, TN, FP, FN, and N are true positives, true negatives, false positives, false negatives, and total number of samples to classify, respectively, summed across all the cross validation trials. The performance measures are used to characterize the complete signature, the average of the training or the average of the test signatures.
The algorithms described above generate a plurality of classifiers with varying degrees of performance for the classification task. In order to identify valid classifiers, a threshold performance is set for an answer to the particular classification question. In one preferred embodiment, the classifier threshold performance is set as log odds ratio greater than or equal to 4.00 (i.e., LOR≧4.00). However, higher or lower thresholds may be used depending on the particular dataset and the desired properties of the classifiers so obtained. Of course many queries of the dataset with a classification will not generate a valid classifier.
Two or more valid signatures may be generated that are redundant or synonymous for a variety of reasons. Different classification questions (i.e., class definitions) may result in identical classes and therefore identical signatures. For instance, the following two class definitions define the exact same treatments in the database: (1) all treatments with molecules structurally related to statins; and (2) all treatments with molecules having an IC50<1 μM for inhibition of the enzyme HMG CoA reductase.
In addition, when a large dataset is queried with the same classification question using different algorithms (or even the same algorithm under slightly different conditions) different, valid signatures may be obtained. These different signatures may or may not comprise overlapping sets of variables; however, they each can accurately identify members of the class of interest.
For example, as illustrated in Table 1, two equally performing gene signatures (LOR=˜7.0) for the fibrate class of compounds may be generated by querying a chemogenomic dataset with two different algorithms: SPLP and SPLR. Genes are designated by their accession number and a brief description. The weights associated with each gene are also indicated. Each signature was trained on the exact same 60% of the multivariate dataset and then cross validated on the exact same remaining 40% of the dataset. Both signatures were shown to exhibit the exact same level of performance as classifiers: two errors on the cross validation data set. The SPLP derived signature consists of 20 genes. The SPLR derived signature consists of eight genes. Only three of the genes from the SPLP signature are present in the eight gene SPLR signature.
It is interesting to note that only three genes are common between these two signatures, (K03249, BF282712, and BF387347) and even those are associated with different weights. While many of the genes may be different, some commonalities may nevertheless be discerned. For example, one of the negatively weighted genes in the SPLP derived signature is NM—017136 encoding squalene epoxidase, a well-known cholesterol biosynthesis gene. Squalene epoxidase is not present in the SPLR derived signature but aceto-acteylCoA synthetase, another cholesterol biosynthesis gene is present and is also negatively weighted.
Additional variant signatures may be produced for the same classification task. For example, the average signature length (number of genes) produced by SPLP and SPLR, as well as the other algorithms, may be varied by use of the parameter p (see e.g., El Ghaoui, L., G. R. G. Lanckriet, and G. Natsoulis, 2003, “Robust classifiers with interval data” Report# UCB/CSD-03-1279. Computer Science Division (EECS), University of California, Berkeley, Calif.; and U.S. provisional applications U.S. Ser. No. 60/495,975, filed Aug. 13, 2003 and U.S. Ser. No. 60/495,081, filed Aug. 13, 2003, each of which is hereby incorporated by reference herein). Varying ρ can produce signatures of different length with comparable test performance (Natsoulis et al., 2004, Gen. Res.). Those signatures are obviously different and often have no common genes between them (i.e., they do not overlap in terms of genes used).
C. Stripping Valid Classifiers to Generate the “Necessary” Variables
Each individual classifier or signature is capable of classifying a dataset into one of two categories or classes defined by the classification question. Typically, an individual signature with the highest test log odds ratio will be considered as the best classifier for a given task. However, often the second, third (or lower) ranking signatures, in terms of performance, may be useful for confirming the classification of compound treatment, especially where the unknown compound yields a borderline answer based on the best classifier. Furthermore, the additional signatures may identify alternative sources of informational rich data associated with the specific classification question. For example, a slightly lower ranking gene signature from a chemogenomic dataset may include those genes associated with a secondary metabolic pathway affected by the compound treatment. Consequently, for purposes of fully characterizing a class and answering difficult classification questions, it is useful to define the entire set of variables that may be used to produce the plurality of different classifiers capable of answering a given classification question. This set of variables is referred to herein as a “necessary set.” Conversely, the remaining variables from the full dataset are those that collectively cannot be used to produce a valid classifier, and therefore are referred to herein as the “depleted set.”
The general method for identifying a necessary set of variables useful for a classification question involved what is referred to herein as a classifier “stripping” algorithm. The stripping algorithm comprises the following steps: (1) querying the full dataset with a classification question so as to generate a first linear classifier capable of performing with a log odds ratio greater than or equal to 4.0 comprising a first set of variables; (2) removing the variables of the first linear classifier from the full dataset thereby generating a partially depleted dataset; (3) re-querying the partially depleted dataset with the same classification question so as to generate a second linear classifier and cross-validating this second classifier to determine whether it performs with a log odds ratio greater than or equal to 4. If it does not, the process stops and the dataset is fully depleted for variables capable of generating a classifier with an average log odds ratio greater than or equal to 4.0. If the second classifier is validated as performing with a log odds ratio greater than or equal to 4.0, then its variables are stripped from the full dataset and the partially depleted set if re-queried with the classification question. These cycles of stripping and re-querying are repeated until the performance of any remaining set of variables drops below an arbitrarily set LOR. The threshold at which the iterative process is stopped may be arbitrarily adjusted by the user depending on the desired outcome. For example, a user may choose a threshold of LOR=0. This is the value expected by chance alone. Consequently, after repeated stripping until LOR=0 there is no classification information remaining in the depleted set. Of course, selecting a lower value for the threshold will result in a larger necessary set.
Although a preferred cut-off for stripping classifiers is LOR=4.0, this threshold is arbitrary. Other embodiments within the scope of the invention may utilize higher or lower stripping cutoffs e.g., depending on the size or type of dataset, or the classification question being asked. In addition other metrics could be used to assess the performance (e.g., specificity, sensitivity, and others). Also the stripping algorithm removes all variables from a signature if it meets the cutoff. Other procedures may be used within the scope of the invention wherein only the highest weighted or ranking variables are stripped. Such an approach based on variable impact would likely result in a classifier “surviving” more cycles and defining a smaller necessary set.
The resulting fully-depleted set of variables that remains after a classifier is fully stripped from the full dataset cannot generate a classifier for the specific classification question (with the desired level of performance). Consequently, the set of all of the variables in the classifiers that were stripped from the full set are defined as “necessary” for generating a valid classifier.
The stripping method utilizes a classification algorithm at its core. The examples presented here use SPLP for this task. Other algorithms, provided that they are sparse with respect to genes could be employed. SPLR and SPMPM are two alternatives for this functionality (see e.g., El Ghaoui, L., G. R. G. Lanckriet, and G. Natsoulis, 2003, “Robust classifiers with interval data” Report# UCB/CSD-03-1279. Computer Science Division (EECS), University of California, Berkeley, Calif.; and U.S. provisional applications U.S. Ser. No. 60/495,975, filed Aug. 13, 2003 and U.S. Ser. No. 60/495,081, filed Aug. 13, 2003, each of which is hereby incorporated by reference herein).
In one embodiment, the stripping algorithm may be used on a chemogenomics dataset comprising DNA microarray data. The resulting necessary set of genes comprises a subset of highly informative genes for a particular classification question. Consequently, these genes may be incorporated in diagnostic devices (e.g., polynucleotide arrays) where that particular classification is of interest. In other exemplary embodiments, the stripping method may be used with datasets from a proteomic experiments.
Besides identifying the “necessary” set of variables for a classifier, another important use of the stripping algorithm is the identification of multiple, non-overlapping sufficient sets of variables useful as classifiers for a particular question. These non-overlapping sufficient sets are a direct product of the above-described general method of stripping valid classifiers. Where the application of the method results in a second validated classifier with the desired level of performance, that second classifier by definition does not include any variables in common with the first classifier. Typically, the earlier stripped non-overlapping classifiers yield higher performance with fewer variables. In other words, the earliest identified sufficient set usually comprises the highest impact, most information-rich variables with respect to the particular classification question. The valid classifiers that appear during the application of the stripping algorithm typically contain a larger number of variables. However, these later appearing classifiers may provide valuable information regarding normally unrecognized relationships between variables in the dataset. For example, in the case of non-overlapping gene signatures identified by stripping in a chemogenomics dataset, the later appearing signatures may include families of genes not previously recognized as involved in the particular metabolic pathway that is being affected by a particular compound treatment. Thus, functional analysis of a gene signature stripping procedure may identify new metabolic targets associated with a compound treatment.
D. Functional Characterization of Necessary Sets
The stripping method described herein produces a set of variables (e.g., genes) representing the information rich necessary set for a given classification question. Such necessary set, however, may be characterized in finctional terms based on the ability of the information rich genes in the set to supplement (i.e., “revive”) the ability of a fully depleted set to generate valid signatures for the classification question.
Thus, the necessary set for any classification question corresponds to that set of genes from which any random selection when added to a depleted set (i.e., depleted for that classification question) restores the ability of that set to produce signatures with an avg. LOR above a threshold level.
Preferably, the threshold performance is an avg. LOR greater than or equal to 4.00. Other values for performance, however, may be set. For example, avg. LOR may vary from about 1.0 to as high as 8.0. In preferred embodiments, the avg. LOR threshold may be 3.0 to as high as 7.0 including all integer and half-integer values in that range.
The necessary set may then be defined in terms of percentage of randomly selected genes from the necessary set that restore the performance of a depleted set above a certain threshold. Typically, the avg. LOR of the depleted set is ˜1.20, although as mentioned above, datasets may be depleted more or less depending on the threshold set, and depleted sets with avg. LOR as low as 0.0 may be used. Generally, the depleted set will exhibit an avg. LOR between about 0.5 and 1.5.
The third parameter establishing the functional characteristics of a specific necessary set of genes for answering a chemogenomic classification question is the percentage of randomly selected genes that results in restoring the threshold performance of the depleted set. Typically, where the threshold avg. LOR is at least 4.00 and the depleted set performs with an avg. LOR of ˜1.20, typically 16-36% of randomly selected genes from the necessary set are required to restore the average performance of the depleted set to the threshold value. In preferred embodiments, the random supplementation may be achieved using 16, 18, 20, 22, 24, 26, 28, 30, 32, 34 or 36% of the necessary set.
E. Diagnostic Assays and Reagent Sets Using Necessary and Sufficient Sets of Variables
As described above, a large dataset may be mined for a plurality of informative variables useful for answering classification questions. The size of the classifiers or signatures so generated may be varied according to experimental needs. In addition, multiple non-overlapping classifiers may be generated where independent experimental measures are required to confirm a classification. Generally, the necessary and sufficient sets of variables constitute a substantial reduction of data (i.e., relative to that present in the full data set), that needs to be measured to classify a sample. Consequently, the methods of the present invention provide the ability to produce cheaper, higher throughput, diagnostic measurement methods or strategies. In particular, the invention provides diagnostic reagent sets useful in diagnostic assays and the associated diagnostic devices and kits.
Diagnostic reagent sets may include reagents representing a select subset of sufficient variables consisting of less than 50%, 40%, 30%, 20%, 10%, or even less than 5% of the total analytical probes (i.e., detector moieties) present in a larger assay while still achieving the same level of performance in sample classification tasks. In one preferred embodiment, the diagnostic reagent set is a plurality of polynucleotides or polypeptides representing specific genes in a sufficient or necessary set of the invention. Such biopolymer reagent sets are immediately applicable in any of the diagnostic assay methods (and the associate kits) well known for polynucleotides and polypeptides (e.g., DNA arrays, RT-PCR, immunoassays or other receptor based assays for polypeptides or proteins). For example, by selecting only those genes found in a smaller yet “sufficient” gene signature, a faster, simpler and cheaper DNA array may be fabricated for that signature's specific classification task. Thus, a very simple diagnostic array may be designed that answers 3 or 4 specific classification questions and includes only 60-80 polynucleotides representing the approximately 20 genes in each of the signatures. Of course, depending on the level of accuracy required the LOR threshold for selecting a sufficient gene signature may be varied. A DNA array may be designed with many more genes per signature if the LOR threshold is set at e.g., 7.00 for a given classification question. The scope of the present invention includes diagnostic devices based on classifiers exhibiting levels of performance varying from less than LOR=3.00 up to LOR=10.00 and greater.
The diagnostic reagent sets of the invention may be provided in kits, wherein the kits may or may not comprise additional reagents or components necessary for the particular diagnostic application in which the reagent set is to be employed. Thus, for a polynucleotide array applications, the diagnostic reagent sets may be provided in a kit which further comprises one or more of the additional requisite reagents for amplifying and/or labeling a microarray probe or target (e.g., polymerases, labeled nucleotides, and the like).
A variety of array formats (for either polynucleotides and/or polypeptides) are well-known in the art and may be used with the methods and subsets produced by the present invention. In one preferred embodiment, photolithographic or micromirror methods may be used to spatially direct light-induced chemical modifications of spacer units or functional groups resulting in attachment at specific localized regions on the surface of the substrate. Light-directed methods of controlling reactivity and immobilizing chemical compounds on solid substrates are well-known in the art and described in U.S. Pat. Nos. 4,562,157, 5,143,854, 5,556,961, 5,968,740, and 6,153,744, and PCT publication WO 99/42813, each of which is hereby incorporated by reference herein.
Alternatively, a plurality of molecules may be attached to a single substrate by precise deposition of chemical reagents. For example, methods for achieving high spatial resolution in depositing small volumes of a liquid reagent on a solid substrate are disclosed in U.S. Pat. Nos. 5,474,796 and 5,807,522, both of which are hereby incorporated by reference herein.
It should also be noted that in many cases a single diagnostic device may not satisfy all needs. However, even for an initial exploratory investigation (e.g., classifying drug-treated rats) DNA arrays with sufficient gene sets of varying size (number of genes), each adapted to a specific follow-up technology, can be created. In addition, in the case of drug-treated rats, different arrays may be defined for each tissue.
Alternatively, a single substrate may be produced with several different small arrays of genes in different areas on the surface of the substrate. Each of these different arrays may represent a sufficient set of genes for the same classification question but with a different optimal gene signature for each different tissue. Thus, a single array could be used for particular diagnostic question regardless of the tissue source of the sample (or even if the sample was from a mixture of tissue sources, e.g., in a forensic sample).
In addition, it may be desirable to investigate classification questions of a different nature in the same tissue using several arrays featuring different non-overlapping gene signatures for a particular classification question.
As described above, the methodology described here is not limited to chemogenomic datasets and DNA microarray data. The invention may be applied to other types of datasets to produce necessary and sufficient sets of variables useful for generating classifiers. For example, proteomics assay techniques, where protein levels are measured or protein interaction techniques such as yeast 2-hybrid or mass spectrometry also result in large, highly multivariate dataset, which could be classified in the same way described here. The result of all the classification tasks could be submitted to the same methods of signature generation and/or classifier stripping in order to define specific sets of proteins useful as signatures for specific classification questions.
In addition, the invention is useful for many traditional lower throughput diagnostic applications. Indeed the invention teaches methods for generating valid, high-performance classifiers consisting of 5% or less of the total variables in a dataset. This data reduction is critical to providing a useful analytical device. For example, a large chemogenomic dataset may be reduced to a signature comprising less than 5% of the genes in the full dataset. Further reductions of these genes may be made by identifying only those genes whose product is a secreted protein. These secreted proteins may be identified based on known annotation information regarding the genes in the subset. Because the secreted proteins are identified in the sufficient set useful as a signature for a particular classification question, they are most useful in protein based diagnostic assays related to that classification. For example, an antibody-based blood serum assay may be produced using the subset of the secreted proteins found in the sufficient signature set. Hence, the present invention may be used to generate improved protein-based diagnostic assays from DNA array information.
The general method of the invention as described above is exemplified below. The following examples are offered by way of illustration and not by way of limitation. The disclosure of all citations in the specification is expressly incorporated herein by reference.
This example illustrates the construction of a large multivariate chemogenomic dataset based on DNA microarray analysis of rat tissues from over 580 different in vivo compound treatments (311 of which were tested in liver). This dataset was used to generate signatures comprising genes and weights which subsequently were reduced to yield a subsets of highly responsive genes that may be incorporated into high throughput diagnostic devices as described in Examples 2-5.
The detailed description of the construction of this chemogenoric dataset is described in Examples 1 and 2 of Published U.S. patent application No. 2005/0060102 A1, published Mar. 17, 2005, which is hereby incorporated by reference for all purposes. Briefly, in vivo short-term repeat dose rat studies were conducted on over 580 test compounds, including marketed and withdrawn drugs, environmental and industrial toxicants, and standard biochemical reagents. Rats (three per group) were dosed daily at either a low or high dose. The low dose was an efficacious dose estimated from the literature and the high dose was an empirically-determined maximum tolerated dose, defined as the dose that causes a 50% decrease in body weight gain relative to controls during the course of the 5 day range finding study. Animals were necropsied on days 0.25, 1, 3, and 5 or 7. Up to 13 tissues (e.g., liver, kidney, heart, bone marrow, blood, spleen, brain, intestine, glandular and nonglandular stomach, lung, muscle, and gonads) were collected for histopathological evaluation and microarray expression profiling on the Amersham CodeLink™ RU1 platform. In addition, a clinical pathology panel consisting of 37 clinical chemistry and hematology parameters was generated from blood samples collected on days 3 and 5.
In order to assure that all of the dataset is of high quality a number of quality metrics and tests are employed. Failure on any test results in rejection of the array and exclusion from the data set. The first tests measure global array parameters: (1) average normalized signal to background, (2) median signal to threshold, (3) fraction of elements with below background signals, and (4) number of empty spots. The second battery of tests examines the array visually for unevenness and agreement of the signals to a tissue specific reference standard formed from a number of historical untreated animal control arrays (correlation coefficient >0.8). Arrays that pass all of these checks are further assessed using principle component analysis versus a dataset containing seven different tissue types; arrays not closely clustering with their appropriate tissue cloud are discarded.
Data collected from the scanner is processed by the Dewarping/Detrending™ normalization technique, which uses a non-linear centralization normalization procedure (see, Zien, A., T. Aigner, R. Zimmer, and T. Lengauer. 2001. Centralization: A new method for the normalization of gene expression data. Bioinformatics) adapted specifically for the CodeLink microarray platform. The procedure utilizes detrending and dewarping algorithms to adjust for non-biological trends and non-linear patterns in signal response, leading to significant improvements in array data quality.
Log10-ratios are computed for each gene as the difference of the averaged logs of the experimental signals from (usually) three drug-treated animals and the averaged logs of the control signals from (usually) 20 mock vehicle-treated animals. To assign a significance level to each gene expression change, the standard error for the measured change between the experiments and controls is computed. An empirical Bayesian estimate of standard deviation for each measurement is used in calculating the standard error, which is a weighted average of the measurement standard deviation for each experimental condition and a global estimate of measurement standard deviation for each gene determined over thousands of arrays (Carlin, B. P. and T. A. Louis. 2000. “Bayes and empirical Bayes methods for data analysis, ” Chapman & Hall/CRC, Boca Raton; Gelman, A. 1995. “Bayesian data analysis, ” Chapman & Hall/CRC, Boca Raton). The standard error is used in a t-test to compute a p-value for the significance of each gene expression change. The coefficient of variation (CV) is defined as the ratio of the standard error to the average Log10-ratio, as defined above.
This example illustrates the use of the “stripping” method to define the necessary and depleted sets of genes for a chemogenomic classification question.
Stripping algorithm
For each of the 101 classification questions defined by Table 2, the full chemogenomic dataset made according to Example 1 was labeled (i.e., +1, −1, or 0). The labeled dataset was then queried using the SPLP algorithm until it produced a valid signature, defined as performing with a test LOR≧4.0. Then all of the genes of from the first valid signature were eliminated (i.e., “stripped”) from the full dataset. This now partially depleted dataset was then queried with the SPLP algorithm again until a second cross validated signature was computed applying the SPLP algorithm to the partially depleted dataset. If this second signature was valid, i.e., performed with a test LOR≧4.0, all of its genes were stripped from the full dataset. This process was repeated until the algorithm failed to produce a valid signature. The union set of all the “stripped” genes used in the valid signatures constituted the “necessary set.”
Yhe genes remaining in the dataset at the end of this stripping procedure were “depleted” for the specific classification question and could be revived only by adding back some percentage of the stripped genes ( see e. g., Example 3 below). Note that this depletion is full with respect to the selected thresold of LOR=4.0. However, this set could be depleted further if additional stripping were preformed with a second lower threshold, e.g., LOR=0.
Table 3 lists 62 of the 101 classifications where stripping resulted in a “failure” of the SPLP algprithm to produce another valid signature (LOR≧4.0) before the 4th cycle of stripping. The columns in the left portion of the Table 3 with the headings “1st,” 2st,” 3st,” and “4th, list the LOR for the best signature defined at each cycle. All 62 classification questions produced a valid gene signature at the first cycle, but only classifications 1-33 produced a valid second signature, and only classifications 1-9 produced a valid third signature. None of the 62 produced a valid fourth signature using the SPLP algorithm.
The Table 3 column labeled “sufficient set” lists the number of genes in the first and therefore “best” valid signature. The column labeled “necessary set” lists the number of genes in the union of the sufficient signatures identified each cycle with LOR≧4.00.
For the signatures 34 to 62, where failure occurred at the second cycle of computation, the necessary set is identical to the sufficient set. For signatures 10 to 33, where failure occurred at the third cycle of computation, the necessary sets correspond to the union of the genes present in the 1st and 2nd cycle. For the remaining 9 of the 62 signatures, the necessary set is the union of the 1st, 2nd and 3rd cycle genes as those signatures failed at the 4th cycle.
Table 4 lists the specific 79 genes of the monoamine re-uptake (SERT) inhibitor ature (i.e., classification 1 from Table 2 above) after the first cycle. Each of the 79 genes is ed with its corresponding weight. A bias of 1.69 was used in deriving the weights.
Table 5 lists the 311 genes in the necessary set of the monoamine re-uptake (SERT) inhibitor signature derived according to the stripping method described above. In performing the stripping both the first and second LOR threshold value were set at greater than or equal to 4.0. The necessary set represents the union of the genes in the signatures derived in the 1st, 2nd, and 3rd stripping cycles shown above in Table 3.
Table 6 lists the remaining 39 of the 101 liver-based chemogenomic classifications where stripping did not result in failure of the SPLP algorithm to identify a valid signature even after 4 cycles. As in Table 3, the column labeled “sufficient set” lists the number of genes in the initial “best” sufficient signature. The column labeled “necessary set” lists the number of genes in the union of sufficient signatures identified at each of the four cycles. Because all of the 39 classifications produced a valid signature even after 4 cycles, the number in the “necessary set” column represents the minimum number in the necessary set for that classification question.
The results depicted in Table 3 indicate that for many gene expression based signatures (e.g., 62 out of 101), 1-3 valid non-overlapping gene signatures may be generated and consequently, the necessary set is just 2-3 times larger than the sufficient set of variables. The results shown in Table 6, however, demonstrate that a substantial number of classification questions generate a large number of non-overlapping valid signatures. In those cases, the necessary set must be on average at least four-fold larger than the best sufficient set.
In order to confirm these results and to determine the size of the necessary set for some of the more degenerate classification tasks, one classification question that failed at the 2nd cycle (NSAID, cox2/1,coxib like) and three classification questions that did not fail even up to the 4th cycle (HMG CoA Reductase, Bile Duct Hyperplasia, PPARα) were analyzed in greater depth. Specifically, the procedure outlined above was repeated but the algorithm was allowed to proceed until all LOR drop below 4.0.
As shown by the plot depicted in
This example illustrates how the necessary set of genes for a classification question may be functionally characterized by randomly supplementing and thereby restoring the ability of a depleted dataset to generate signatures above an average LOR. In addition to demonstrating the power of the information rich genes in a necessary set, this example illustrates a system for describing any necessary set of genes in terms of its performance parameters.
As described in Example 2, a necessary set of 311 genes (see Table 5) for the SERT inhibitor classification question was generated via the stripping method. In the process, a corresponding fully depleted set of 8254 genes (i.e., the full dataset of 8565 genes minus 311 genes) was also generated. The fully depleted set of 8254 genes was not able to generate a SERT inhibitor signature capable of performing with a LOR greater than or equal to 4.00.
A further 311 genes were randomly removed from the fully depleted set. Then a randomly selected set including 5, 10, 20, 40 or 80% of the genes from either: (a) the necessary set; or (b) the set of 311 randomly removed from the fully depleted set; were added back to the depleted set minus 311. The resulting “supplemented depleted” set was then used to generate a SERT inhibitor signature, and the performance of this signature was cross-validated. This process was repeated 50 times each for the depleted set supplemented with some percentage of genes from the necessary set and supplemented with the random 311 genes removed from the original depleted set. Fifty cross-validated SERT inhibitor signatures were obtained for each various percentages of depleted set supplementation. Average LOR values were calculated based on the 50 signatures generated in each case.
The power of the information rich genes in the necessary set was demonstrated by the results tabulated in
The above shows how supplementation with necessary set genes “revives” a fully depleted set. This ability is a common characteristic of any necessary set. This functional characteristic may be quantified with a plot of avg. LOR versus the percentage of random genes used to supplement the depleted set. As shown by the plot in
This example illustrates how the stripping method of Example 2 may be used to carry out a functional analysis of genes within the non-overlapping sufficient signature of the PPARα necessary set.
All of the valid classifiers for a given classification question must by definition overlap with the necessary gene set as defined herein. This is a direct consequence of the fact that the fully depleted set (the remaining genes after the last successful cycle of stripping) cannot produce a valid classifier. It should be informative to submit the necessary set to functional analysis because this gene set constitutes all the genes that in some combination can yield a valid classifier for a specific classification question.
Clustering Analysis Offirst Five Sufficient Sets
A preliminary analysis was performed of the 317 genes identified in the first 5 cycles of the PPARα signature stripping procedure. Starting with a table (genes are rows and compound treatments are columns) of gene expression logratios, a table of the weighted expression (also referred to as the gene's “impact”) was produced where each line, corresponding to a gene, was multiplied by its weight in the corresponding signature. The vertical dimension of the table was reduced by generating a single column for the maximum weighted expression (impact) achieved by a drug under any treatment conditions. Most drugs were tested at two doses and four time points. This procedure thus reduces the number of columns by a factor of eight.
The weighted table was clustered using UPGMA, a standard algorithm available through Spotfire DecisionSite™ to produce the image depicted in
PPARα agonists induce FABO genes (see e.g., Kersten, S., B. Desvergne, and W. Wahli, “Roles of PPARs in health and disease,” Nature 405: 421-424 (2000)), and FABO genes are used as reward genes in the initial signature run (see e.g., Natsoulis et al. 2004, Gen. Res.). This result suggests that after five cycles of stripping the algorithm keeps replacing the eliminated FABO reward genes with other FABO genes to produce a valid classifier. The rightmost column of
Non-Overlapping Signatures can be used to Confirm Signature Hits
The stripping procedure described above may be used to confirm signature hits. For example, it was previously observed that an unknown compound (“compound X”) had a positive scalar product when analyzed against the PPARα signature, however the scalar product was near that of the weakest of the known PPARα agonists, clofibric acid. In this situation, the question arises whether compound X is a “false” positive hit. For example, the apparent match of compound X to the PPARα signature may have been the result of an artifact on the expression microarray that escaped quality control. Given that each successive signature obtained by stripping is composed of a different set of genes (or at least a different set of probes on the array) these independently derived signatures may be used to confirm the match of an unknown to a signature.
To illustrate this application the PPARα label set was modified. Originally, the unmodified labels for the PPARα signature were set such that all known PPARα agonists (42 treatments corresponding to 8 compounds) were labeled as “+1” and all treatments (˜1600) with other drugs (˜310) were labeled as “−1”. These PPARα label set was modified as follows: 10 randomly chosen non-PPARα compounds were set aside and not used in the generation of a new PPARα signature. These set aside compound treatment experiments were labeled “−2” to distinguish them from the unknown compound treatment which, was labeled as “0”. Neither the “0” labeled not the “−2” labeled compounds take part in the signature generation. The new PPARα signature was trained for the 8 known PPARα compounds (labeled “+1”) against the other 300 non-PPARα compounds (labeled as “−1”). The maximum scalar product achieved under any treatment condition was calculated for each compound and for each of the 5 cycles of stripping. As shown by the results tabulated in
GO Analysis of PPARα Gene Sets
The complete results for the PPARα signature show that 40 cycles of stripping, involving 5706 genes, were needed to define the necessary set for this signature. A repeat of the analysis described in
The hypergeometric formula was used to assess the significance of the enriched GO terms. The most significantly enriched GO term in the 234 reward genes is unsurprisingly FABO and several other terms related to lipid metabolism. All metabolism genes were subtracted from the set of 234 reward genes and the remaining set was submitted again to the same analysis. The most significant term in this second analysis was “transport.” A third round of analysis revealed “adhesion” as the most significant term. No other significant terms were detected after subtracting adhesion related genes.
In order to determine whether genes belonging to these three GO terms are used successively the enrichment in each of the three terms was plotted as a function of the cycles (referred to in
Identification of an Alternate Pathway Correlation for PPARα Agonists.
The fact that adhesion and transport genes may be used to classify the effect induced by PPARα agonists indicates that these genes may be targets for PPARα related diseases. These alternate PPARα related genes are believed to be novel and unlikely to be uncovered by other functional analysis methods in large part because of the predominant effect of the FABO genes. Uncovering alternate pathways whose gene expression is altered in a characteristic manner by PPARα agonists may have great biological significance. While the PPARα agonists are known to induce beta oxidation they are also known to induce peroxisomal proliferation, at least in rodents, and peroxisomal proliferation may be the cause of the increased liver cancers observed in rodents exposed to PPARα agonists. PPARα agonists do not cause peroxisomal proliferation in humans, yet the suspicion remains that they may still elevate the risks of liver cancer.
Thus, the present analysis reveals a plurality of distinct gene signatures, all of them sufficient to classify of the effect of PPARα agonists as they meet the LOR≧4.0 threshold criteria for signature validation. By design, none of these signatures overlap by a single gene. Yet the stripping algorithm reveals that the signatures tend to use initially the induction of some of the more prominent and well recognized FABO genes while they only later use other pathways such as adhesion and transport. The signatures using predominantly adhesion molecules may be used as a marker for important side effects of PPARα agonists in rodents. The same genes or their orthologs could also form the basis of a diagnostic to detect early signs of neoplastic transformation in liver biopsies of PPARα agonist treated humans.
Functional Analysis of the Non-Overlapping Sufficient Sets Within the HMG CoA (statin) Necessary Set
A similar functional analysis of the HMG CoA Reductase (statin) signatures may be carried out according to the methods described in Example 4. The HMG CoA Reductase (statin) signatures revealed by the stripping algorithm defined a necessary gene set composed of 1771 genes. Of these 168 are reward genes. The GO analysis described above for the PPARα signature was repeated for the statin signature. The most significant GO term in the set of reward genes is “sterol metabolism.” This result is not surprising as statins are known to induce many cholesterol biosynthesis genes. Removing “metabolism,” a superset of the “sterol metabolism” genes, reveals that signal transduction genes constitute the next most significant term.
The enrichment of the three terms (sterol metabolism, metabolism and signal transduction) was graphed as function of stripping cycles (
Recently substantial effort has been devoted to the study of the multiple therapeutically beneficial effects of statin drugs. The direct effects of statins on cholesterol biosynthesis are well-known. The recognition that statins may have anti-proliferative and anti-inflammatory properties, both of which may contribute to the control of atherosclerosis, has only recently been suggested. The above-described analysis of the necessary set of genes relevant to statin classifiers provides further support for this new hypothesis.
All publications and patent applications cited in this specification are herein incorporated by reference as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference.
Although the foregoing invention has been described in some detail by way of illustration and example for clarity and understanding, it will be readily apparent to one of ordinary skill in the art in light of the teachings of this invention that certain changes and modifications may be made thereto without departing from the spirit and scope of the appended claims.
This application claims priority from U.S. Provisional Application No. 60/579,183, filed Jun. 10, 2004, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60579183 | Jun 2004 | US |