The invention relates to methods and devices for evaluating poisons and other therapeutic entities. Some of the methods and uses are related directly to unfavorable drug effects, and others will be more widely applicable to generic evaluation of pharmacology and therapeutic index.
Developing a new therapeutic drug candidate from initial concept to market sales typically requires from 12 to 15 years of research and development time, and has been estimated to require investing nearly one billion dollars. See, e.g., UBS Warburg Report, Charles River Laboratories, Feb. 28, 2003, pages 7-8; and www.fda.gov. A significant portion of these expenditures occurs during preclinical animal testing, and even more is spent on human clinical safety and efficacy testing. Pharmacology or toxicology problems which remain undetected until later stages of drug development are extremely problematic, both in terms of dollar expenditures and lost time. And the situation is even worse if toxicity problems remain undetected until after market introduction. Thus, early and accurate assessment of safety and efficacy of candidate therapeutic entities, along with proper administration and treatment methods, is indispensable to efficient development of new therapeutic entities. Pharmacology is a science directed to the study of the action of substances, typically chemicals and other entities, on biological systems. This encompasses both pharmacodynamics and pharmacokinetics. See, e.g., Berkow, et al. The Merck Manual Merck and Co.; Hardman, et al (eds. 2001) Goodman and Gilman's: The Pharmacological Basis of Therapeutics (10th Ed.) McGraw-Hill, ISBN: 0071354697; and other academic and professional school textbooks used in teaching pharmacology. The US Food and Drug Administration (FDA) is concerned with virtually all aspects of use of substances in therapeutic or diagnostic contexts. See, e.g., www.fda.gov.
A closely related area of relevance is toxicology, which addresses features and properties of substances which may have specific defined effects on the systems, typically leading to negative or undesirable effects. See, e.g., Klaassen, et al. (eds. 2001) Casarett and Doull's Toxicology: The Basic Science of Poisons (6th ed.) McGraw-Hill, ISBN: 0071347216; and Hayes (ed. 2001) Principles and Methods of Toxicology (4th Ed.) CRC Press, ISBN: 1560328142; and other academic and professional school textbooks used in teaching toxicology.
However, elucidating the pharmacology and toxicology of a substance, e.g., a therapeutic entity or potential new treatment, requires a significant amount of study to determine the optimal means and methods for use in a defined biological context. This is both a costly and laborious process, which requires enormous investment in financial and other resources. This fact is recognized by the US FDA and other national drug regulatory agencies. It has been stated that commercially available tests for safety monitoring (biomarkers) are urgently needed, and such remains the least developed area of pharmacogenomic monitoring and individualized medical treatment.
Citation of documents above and hereafter is not intended as an admission that any is pertinent prior art. All statements as to the date or representation as to the contents of documents is based on the information available to the applicant and does not constitute any admission as to the correctness of the dates or contents of the documents.
The present invention is directed to accelerating the speed of development and reducing the resource investment necessary to determine these features for directing use of such substances or treatments to appropriate biological contexts.
The present invention provides lists of biomarkers for analysis, either directly or indirectly, which affect the toxicity pathways. These may be evaluated at many levels, including genetic, genotyping, evaluation of combination pairing of diploid alleles or haplotypes, RNA expression, protein expression, functional activity, post-translational analysis or evaluation, etc. Thus, the biomarkers refer to the corresponding genetic information, RNA, protein, or other structural embodiments thereof. And the means to use these biomarkers, e.g., to evaluate status of toxicity pathways, to evaluate individual risk or susceptibility to various toxic pathways from exposure or therapeutic intervention, to generate test systems for drug development, are all provided by identifying critical and significant contributors to the pathway progression.
The invention further provides methods for detecting the state of a toxicity pathway in a primate, said method comprising evaluating the form or function of a discriminatory biomarker selected from: (a) Table 4, subset 1; (b) Table 4, subset 2; (c) Table 4, subset 3; (d) Table 3A or 6A, subsets 2 or 3; Table 3A or 6A, subset 1; Table 2A or 5A, subsets 2 or 3; and (e) Table 2A or 5A, subset 1. Specific datasets also provide various markers, individually or in various combinations. Various pluralities or combinations of those markers are important in liver or other toxicity pathways. In various embodiments: the toxicity pathway is affected in response to a therapeutic treatment, including administration of a drug or combination of therapies; the primate is a chimpanzee; the form of evaluating is determination of genetic presence of a specific allelic form or specific combination of diploid alleles of said discriminatory biomarker; the form of evaluating is expression at a nucleic acid or protein level, including allelic diploid combinations of said discriminatory biomarker; the form of evaluating is a protein evaluation, including an immunoassay, modification, quantitation, mass spectroscopy, NMR, imaging, or characteristic temporal pattern determination; the form of evaluation is determination of functional activity of said discriminatory biomarker, including a detectable substrate or product of an enzymatic activity affected by said biomarker; the form of evaluation is expression or functional localization of said discriminatory biomarker in said primate, including imaging or localization; the evaluating is from a blood, hair, skin, saliva, or accessible body fluid sample or part; the evaluation includes a plurality of discriminatory biomarkers in said subsets, including biomarkers from a plurality of different subsets, including from Tables 2A/B, 3A/B, 4, 5A/B, or 6A/B; or the evaluation is a non-invasive method, including an imaging agent or detectable or labeled compound.
Based upon these identified biomarkers, the invention provides label, diagnostic reagent, or diagnostic means directed to the identified discriminatory biomarker(s); and various kits comprising such and instructions or devices for using such and/or interpreting the results there from.
In preferred embodiments, the kit: (i) evaluates a multiplicity of biomarkers from Table 4; (ii) is designed to evaluate or distinguish between a plurality of defined liver toxicity pathways; (iii) is designed to further evaluate other toxicity pathways other than in liver; or (iv) is designed to evaluate a status of a toxicity pathway induced by a therapy or drug; or the diagnostic reagent or means: (i) evaluates presence or absence of specific alleles corresponding to said discriminatory biomarker; (ii) evaluates presence or absence of specific diploid combinations of alleles or haplotypes corresponding to said discriminatory biomarker; (iii) evaluates a plurality of said discriminatory biomarkers; (iv) evaluates said discriminatory biomarker over multiple time points; or (v) evaluates at least one other marker or feature.
The invention also provides test systems for chemical or biologic compounds, to screen or evaluate the impact on toxicity or other pathways affected by said compounds. In preferred embodiments, the test system: (a) incorporates a plurality of the identified discriminatory biomarkers; (b) incorporates a plurality of different features of said discriminatory biomarkers; (c) is designed to also evaluate status of non-liver toxicity pathways; or (d) evaluates various features of biomarkers selected from the identified discriminatory biomarkers.
A computer system is further provided which: (a) includes a file which provides listings of discriminatory biomarkers including at least one identified biomarker linked to status of toxicity pathways; (b) is capable of providing output of specific features of identified biomarkers which are indicative of status of toxicity pathways in particular patient subclasses; or (c) includes a file which links appropriate features of appropriate identified biomarkers, in addition to appropriate features of biomarkers for different pathways of toxicity in muscle, neurological, or bone tissue.
In other embodiments, the present invention provides methods of correlating the state of a toxicity pathway to a combination of diploid haplotypes present in a biological system. In certain embodiments, the toxicity pathway is: expressed significantly in liver, muscle, neurological, or bone marrow; expressed primarily in the GI tract, kidney, or skin; induced by a therapeutic treatment; or is induced by administration of one or a combination of drugs; or the combination of diploid haplotypes: represent at least 60% of the allelic combinations found in the US, Western Europe, or Japanese national populations; represent at least 15 different genes; represent at least 7 non-contiguous haplotype blocks; represent at least 4 different non-Y chromosomes; span at least 100 centimorgans; include a plurality which are derived from a vertebrate; are evaluated by characterization of protein features, e.g., by ELISA; or include some haplotypes from a primate; or the biological system is: a soluble test system; a cell line; an organ system; or an animal; or the correlating is: performed on a computer, which collates data to generate a file of particular identified combinations of alleles which exhibit defined categories of risk from said status of said toxicity pathway; or used to develop a set of combination diploid haplotypes which are correlated and validated to be incorporated into a diagnostic product, including one useful to predict toxicity pathway status in a subject.
The invention further provides methods of identifying additional relevant genes as candidate test targets for toxicity pathway evaluation, by taking a first list of candidate targets and identifying a second list of additional candidate targets: (1) which in an interaction database have been reported to interact physically with said targets of list 1, including a physical interaction or 2-hybrid physical interaction; (2) which have been commonly referred to in a reference with a target of list 1, said reference being in the abstract of a paper contained in a literature database; (3) whose gene expression profiles match the expression profiles of those members of list 1 in similar tissues; (4) which have been co-localized in expression analyses in similar tissues; or (5) which are closely located physically on a chromosome. In various embodiments, the toxicity pathway is: expressed significantly in liver, muscle, neurological, or bone marrow; expressed primarily in the GI tract, kidney, or skin; is induced by a therapeutic treatment; or is induced by administration of one or a combination of drugs; or the first subset of candidate targets are derived from some screening methodology, including SNP analysis, gene expression profiling, post-translational modification analysis, and mass spectroscopy; or the second list: contains fewer than three times as many candidates as list 1; contains at least 20 candidate targets; contains at least 20% metabolic enzymes or transporters; is screened to validate members thereof which can classify status of said toxicity pathway into categories of risk; or the interaction database: includes data from the NCBI or PubMed databases; comprises at least 10,000 reports of physical interactions; uses manual collation, gene symbol designation, and/or word term matching; or the literature database: comprises at least 200,000 documents; contains completely abstracts of at least 100 journals since 1990; contains completely abstracts of at least 1000 journals since 1970; contains at least 20 thousand document abstracts; contains at least 500 thousand document abstracts; is available from Ingenuity or GeneGo; includes the NCBI and/or PubMed literature databases; or the gene expression profiles: match in one or more of liver, muscle, brain, bone, GI tract, kidney, skin, or oral mucosa; are in the same organ and physiology as those in list 1; are in a primate; or the additional candidates exhibit at least 2 of the criteria for inclusion into said second list; or the identifying is: performed on a computer which maintains the list 2 of candidate genes, which are subjected to validation to develop a diagnostic product there from; or followed by validation of relevance of candidates as classifier biomarkers to toxicological status, which may lead to efforts to develop a diagnostic product.
Other methods are provided herein of correlating status of a toxicity pathway to temporal patterns of features of classifier biomarkers determined at multiple time points, e.g., within a single individual, wherein said feature is selected from: (1) RNA expression of selected genes; (2) protein expression of selected genes; (3) post translational features of selected genes; (4) metabolic conversions of reactants or products of selected genes; (5) cellular, organ, or tissue localization of a biological product or tracer (including nucleic acid, protein, carbohydrate, phosphorylation, label, or toxin); or (6) features of acute liver metabolic enzymes or transporters. Preferred embodiments include, e.g., those where: the toxicity pathway is: expressed significantly in liver, muscle, neurological, or bone marrow; expressed primarily in the GI tract, kidney, or skin; is induced by a therapeutic treatment; or is induced by administration of one or a combination of drugs; or where the temporal pattern is an increase, decrease, stable then change, increase then decrease, or decrease then increase; or the classifier biomarker is evaluated in a whole organism, including a primate; or the time points span: hours to weeks to months; from before to after one or more toxicity symptom is manifested; or the classifier biomarker is assayed by an imaging agent, a test reagent, or detectable reactant or product; or the correlating is: performed on a computer, which collates data to generate a file of particular identified temporal patterns of features which define categories of risk from said status of said toxicity pathway; or used to develop a set of identified temporal patterns of features which are correlated and validated to be incorporated into a diagnostic product, including one useful to predict toxicity pathway status in a subject.
Yet other methods are provided correlating status of a toxicity pathway to classifier biomarkers, wherein: (1) said markers are monitored in a genetically homogeneous primate population with substantial medical records allowing generation or testing of correlation of said status of toxicity pathway with said biomarkers in said population; or (2) a sufficiently large population of primates with access to: (i) primate biological samples; or (ii) sufficient diagnostic data within the record, such allowing selection of a subset of said population with sufficient numbers to evaluate from said subset correlation of non-therapy related toxicity to classifier biomarkers. Various particular embodiments include where the toxicity pathway is: expressed significantly in liver, muscle, neurological, or bone marrow; expressed primarily in the GI tract, kidney, or skin; not induced by a therapeutic treatment; or induced by administration of a combination of drugs; or the classifier biomarkers: include a plurality of both metabolic enzymes and transporters; or number at least 10 different classifier biomarkers; or the genetically homogeneous population: has accessible medical records and informed consent for at least 30 thousand individuals; is located in the US; is from Finland, Iceland, Sardinia, or Estonia; has essentially full medical records for individuals for at least 5 years previous to testing of biomarkers; has a LD of less than 0.80 on a median intermarker distance of 4.5 KB; or has highly conserved mitochondrial DNA sequence; or the samples are archived or banked; or the subset has phenotypic homogeneity by selection criteria; or the correlating is: performed on a computer, which collates data to generate a file of particular genotypes or other features which define categories of risk from said status of said toxicity pathway; or used to develop a set of genotypes or features which are correlated and validated to be incorporated into a diagnostic product, including one useful to predict toxicity pathway status in a subject.
The invention further provides combinations of the methods, e.g., studying biology of a mammal, comprising combining a method of correlation analysis between phenotype and a diploid haplotype with extending a list of functional candidate entities from list 1 to list 2 by system biology linkage, which may include linkage by physical interaction and/or literature connection by common reference in a published abstract. Exemplary embodiments include wherein (i) a list 1 is extended to a list 2, and said phenotype is further correlated with diploid haplotype combinations corresponding to at least one functional candidate of list 2; (ii) a diploid analysis is performed, and the phenotype is further correlated with another feature of a functional candidate of list 2 resulting from extending of a list 1 of candidates evaluated in said diploid analysis; or (iii) a diploid analysis is performed, and the phenotype is further correlated with another feature of a functional candidate of list 2 resulting from correlation to a list 1 candidate resulting from analysis of a different parameter.
Additional methods are provided, e.g., which combine methods of correlation analysis between phenotype and a combination of diploid haplotypes with evaluating multiple time point features, which may include haploid or combination diploid analysis. Certain embodiments include, e.g., wherein correlation between said biology in said mammal is with: (i) at least one diploid haplotype combination and at least one multiple time point feature; or (ii) a plurality of diploid haplotype combinations and multiple time point features.
The invention further provides combining methods of correlation analysis between phenotype and a plurality of non-adjacent haplotypes with use of a “homogeneous” primate population, which may include genetically homogeneous or phenotypically selected “subclasses” from a larger collection by medical record or other selection criteria. In some embodiments, the population is a genetically homogeneous population; or the biology is not a response to treatment.
Other methods extend a list of functional candidates from list 1 to list 2 by system biology linkage, which may include linkage by physical interaction and/or literature connection by common reference in a published abstract, with evaluating multiple time point features, which may include haploid or combination diploid tracking. Preferred embodiments include where: (i) the correlation between said biology in said mammal is with at least one parameter of a list 2 candidate and one multiple time point feature; (ii) the correlation is with a multiple time point feature of a list 2 candidate; or (iii) the correlation is with a feature of a list 2 candidate resulting from a multiple time point analysis.
Yet another embodiment of the invention results from combining methods to extend a list of functional candidates from list 1 to list 2 by system biology linkage, which may include linkage by physical interaction and/or literature connection by common reference in a published abstract, with methods which use a “homogeneous” primate population, which may include genetically homogeneous or phenotypically selected “subclasses” from a larger collection by medical record or other selection criteria. In preferred forms, this may include where: (i) a correlation of said biology in said mammal to a list 1 candidate leads to said list 2 candidate, which is tested for validation in said primate population; or (ii) a hypothesis generated from said population directed to a list 1 candidate is tested by evaluating a list 2 candidate.
Further methods result from combining methods to evaluate multiple time point features, which may include haploid or combination diploid analysis, with use of a “homogeneous” primate population, which may include genetically homogeneous or phenotypically selected “subclasses” from a larger collection by medical record or other selection criteria. Among such embodiments is where the biology is tested in said homogeneous primate population for correlation with multiple time point features.
Also provided are methods using analysis of genetic makeup of a target individual animal to predict therapeutic outcome from administration of a compound or treatment to the target individual, the method involving: establishing correlation of therapeutic outcomes to various combinations of haplotypes or alleles possessed by various individual animals; determining the combination of haplotypes or alleles possessed by the target individual; and applying the correlation from the combination of haplotypes or alleles to predict the therapeutic outcomes. Alternatively, the methods comprise determining the combination of alleles possessed by the target individual (and previously established as correlated with the therapeutic outcome) and; and applying the correlation from the combination of alleles to predict the therapeutic outcomes.
In certain embodiments of the methods, the analysis of the genetic makeup is qualitative or quantitative determination of common haplotypes or alleles across a population of which the target individual is a member, including analysis of haplotype or allele dosage; the analysis is by nucleic acid (DNA, RNA) sequence or polymorphism analysis, (DNA; RNA) hybridization, protein analysis, or enzyme activity analysis; the genetic makeup includes: duplication or multiple copies of an allele or haplotype, chromosome duplication, amplification of a genetic locus, or multiple related alleles of at least 90% amino acid sequence identity over a length of at least 35 amino acids; the target individual is: a primate, rodent, or canine; a companion, work, or show animal; a quadruped, biped, or aquatic animal; a vertebrate, including one with an exoskeleton; or heterozygous or homozygous with respect to the haplotype or allele; or the therapeutic outcome is: a drug adverse event; no drug adverse event; drug efficacy; or no drug efficacy. Yet other methods include those where the administration is: one or more purified chemical entity or compound; topical, oral, parenterally, inhaled, administered to the eye, an implant, or other means; or repeated; or the correlation: is with a coefficient greater than 0.6; has been established with a statistical reliability measure; has been established by testing of a drug adverse event population of greater than 100 adverse events; is combined with another feature from a medical record of the target individual or with another diagnostic result; or is made in a homogeneous founder population of at least 20K individuals; or the allele is in a: cytochrome P450 locus; transporter/pump locus; or “drug metabolizing enzyme” locus. Further methods include those comprising communicating to a recipient a result of the method, wherein: the communication is: written, oral, coded, digital, analog, or passes through US legal jurisdiction; or where the recipient is: within US legal jurisdiction; a medical patient or veterinary owner; a health care professional, medical or veterinary; a regulatory agency or drug development organization; or a health care insurer or auditor.
In another embodiment, the invention provides diagnostic devices comprising means to determine a substantially full complement of haplotypes or alleles of a biomarker possessed by a target diploid individual, the means providing for identifying what haplotypes or alleles are present in the target individual, and evaluating biological function of the product of those haplotypes or alleles. Often, the devices will be ones wherein: the means: simultaneously determine both what haplotypes are present or absent, and what biological function corresponds to the haplotypes; determine the complete protein sequence encoded by each haplotype; are automated and provide a readout result within about three hours; or include dynamic features and/or multiple time points of evaluating; the complement of haplotypes includes: a heterozygous pair of haplotypes; a gene dosage variation different from a chromosomal pair, including a chromosome duplication resulting in triploidy of the chromosome; a plurality of closely related sequences which exhibit both high sequence identity and overlapping biological function (multiple homologs, e.g., where complement of related enzymes affect selectivity/specificity/kinetics of reaction, or transporters); alleles of enzymatic turnover numbers which differ by at least 30%; or surrogate markers which are accepted as diagnostic for a defined phenotype. Other embodiments may be devices where: the biomarker: comprises a plurality of a cytochrome, enzyme, transporter, and/or structural protein; or is represented by at least 5 different alleles or non-contiguous haplotypes found in a population including the individual; or the diploid individual is: a mammal, including a primate, rodent, feline, or canine; a companion, work, or show animal; or an experimental research animal, including a nematode, water flea, insect, or invertebrate; or the evaluating biological function is: by proteomic or metabolomic analysis; or capable of distinguishing different types of pharmacological dose response curves, including an increasing or decreasing, U shaped, bell shaped, or hormetic situation.
Methods using such devices are provided, e.g., methods comprising predicting outcome from a defined treatment of a target individual by evaluating the complement of alleles possessed by the individual using the described device; or where: the outcome is therapeutic efficacy, therapeutic safety, or risk of an adverse reaction; or results of the evaluating are communicated to a recipient wherein: the communication is written, oral, coded, digital, analog, or passes through US legal jurisdiction; or the recipient is: within US legal jurisdiction; a medical patient or veterinary owner; a health care professional, medical or veterinary; a regulatory agency or drug development organization; or a health care insurer or auditor.
Another alternative embodiment of the invention provides methods of identifying biomarkers useful for predicting response of an individual to therapeutic treatment, comprising: collecting a homogeneous population of individuals having received the treatment with a recorded result from the treatment; evaluating genetic markers in a plurality of the individuals in the population to identify biomarkers which correlate with specific recorded result from the treatment; and correlating the genetic markers with the specific recorded results to identify biomarkers which are predictive of the result. Among various methods encompassed include, e.g., those where: the identifying: allows development of a registered diagnostic test or device which evaluates the biomarker, e.g., to predict an adverse drug response; is communicated by written, oral, coded, digital, analog, or means passing through US legal jurisdiction; or is communicated to a recipient who is within US legal jurisdiction; a medical patient or veterinary owner; a health care professional, medical or veterinary; a regulatory agency or drug development organization; or a health care insurer or auditor; or biomarker includes: a dynamic or temporal component in evaluation; multiple endpoint, concentration, temperature, or other analyses; a genetic analysis; or a proteomic and/or metabolomic analysis; or the prediction has an accuracy of at least 95% over a defined population; or the response is a pharmacological or toxicological response.
Other methods encompassed include some, e.g., where the individual is: a mammal, including a primate, equine, bovine, porcine, canine, feline, rodent, or quadruped; a companion, work, or show animal; an experimental research animal, including a nematode, water flea, insect, or invertebrate; or a plant, fungus, protozoa, or prokaryote; or the treatment is administering one or more therapeutic compounds in a predetermined methodology; or the population: comprises at least 2 million individual primates; has a homogeneity exhibiting fewer than about 300K SNPs of frequency occurrences of at least 1% within the population; has medical records accessibility for at least 30% of the population going back at least 3-5 years; and/or has an adverse drug reaction reporting system.
The invention further provides such methods wherein the genetic markers allow prediction of other biomarkers from pathways correlated to the result, and the other biomarkers from the pathways may be tested by perturbations to optimize or identify what perturbations affect correlation of the biomarkers to the result, thereby identifying high correlation biomarkers for the result. This shall include methods wherein: the perturbations: are in a gene sequence or quantity (regulation); protein sequence, modification, or quantity; substrates or analogs thereof (including inhibitors or regulatory subunits); metabolic intermediates; time of endpoint or analyses; temperature; and/or isotopic variants; are achieved by any of gene expression modifiers (including knockout or transformants), gene suppression (e.g., using RNAi or anti-sense), use of dominant negative forms or suppressors, and activating mutants; are achieved by chemical perturbations, e.g., by varying concentrations of small molecule inhibitors, co-factors (natural or otherwise), or activators; and/or are evaluated as a function of time; and/or the high correlation biomarkers can be: incorporated into an experimental system which can be used to model effect of a therapeutic treatment back to a target individual or subsystem thereof; monitored in an individual to anticipate the timing, severity, or type of a phenotype, e.g., as a pool of surrogate markers; or diagnosed in an individual and predict efficacy or response to a therapeutic treatment, including an adverse drug response.
Further methods are provided, e.g., wherein: the experimental system comprises: a transgenic, transformed, or genetically modified cell; an identified selected genetic, developmental, or physiological variant cell; an in vitro genetic model for a disease; an organism, including a rodent, possessing features characterizing a disease or model; a cell comprising a human gene; a candidate therapeutic entity for treatment of a medical condition; a mammalian stem cell, including a cell derived from a primate; or at least one of the steps occurs outside the United States.
In a drug development context, the method may comprise use of the experimental system to evaluate or prioritize development candidates for pharmacology or toxicology, e.g., in preclinical evaluation.
The invention further provides methods using a combination of cells or systems comprising genetic or physiological variants exhibiting specific high correlation biomarkers for a phenotype, wherein the combination of cell lines is monitored to evaluate therapeutic response to a therapeutic treatment. This will include methods where: the combination of cells or systems: comprise one or more human gene, chromosome, or cell; evaluate effect of different expression levels of one or more haplotype, gene, or phenotype; make use of one or more microfluidic chips, e.g., allowing a series of chips to represent various individuals in a population; provide a model for disease, including an in vitro or in vivo model; provide a surrogate marker for a human or animal phenotype, including toxicity; or are in an intact organism; or where the monitoring evaluates multiple endpoints, a concentration/response, metabolic turnover (including substrate and/or product), a plurality of different assays, and/or multiple genetic variants; or where the phenotype: is in a primate or invertebrate; or allows prediction of therapeutic index of a therapeutic entity in a defined system or animal; or where the cells or systems represent a scope of variation of individuals across a population of the individuals; or where the therapeutic treatment is testing or screening various candidate therapeutic entities, including prioritization of candidates for product development; or where results of evaluation or conclusion resulting is communicated to a recipient, wherein: the communication: is written, oral, coded. digital, or analog, or passes through US jurisdiction; or the recipient is: within US legal jurisdiction; a medical patient or veterinary owner; a health care professional, medical or veterinary; a regulatory agency or drug development organization; or a health care insurer or auditor.
VII. in vitro and in vivo Disease Models
As described above, the need for commercially available tests for safety monitoring (biomarkers) are urgently needed. The US FDA and other groups have formed brainstorming discussion groups such as the industry sponsored Pharmacogenetics Working Group to chart out new strategies and development in this area. Similar concerns exist in foreign drug regulatory agencies. Activity in the pharmacogenomics area is great, including release of the Guidance for Industry Pharmacogenomic Data Submissions (November 2003); a draft Preliminary Concept Paper “Drug-Diagnostic Co-Development Concept Paper” (April 2005); and other papers and guidelines scheduled for release from the FDA in the near future. Coupled with a recent focus on toxicology issues of marketed drugs (Vioxx, Bextra, and others), along with slow and costly development for new drugs, pharmacogenomics has become a higher profile endeavor.
The science of drug development is quite complex. It includes, among others, various areas of study of medical sciences and particularly therapeutic effects, including the sciences of pharmacology and toxicology, along with medicinal chemistry, diagnostic principles, statistics, and validation procedures. Pharmacology is directed to the study of the properties and reactions of drugs especially with relation to their therapeutic values. Various aspects of pharmacology include formulation, adsorption, distribution, metabolism, excretion, and such. See, e.g., Evans (2004) A Handbook of Bioanalysis and Drug Metabolism CRC Press, ISBN: 0415275199; Golan, et al. (2004) Principles of Pharmacology: The Pathophysiologic Basis of Drug Therapy Lippincott Williams and Wilkins, ISBN: 0781746787; Minneman (2004) Brody's Human Pharmacology: Molecular To Clinical (4th ed.) Mosby-Year Book; ISBN: 0323032869; van de Waterbeemd, et al. (2003) Drug Bioavailability: Estimation of Solubility, Permeability, Absorption and Bioavailability (Methods and Principles in Medicinal Chemistry) Wiley-VCH, ISBN: 352730438X; Avdeef (2003) Absorption and Drug Development: Solubility, Permeability, and Charge State Wiley-Interscience, ISBN: 0471423653; Allen (2002) The Art, Science, and Technology of Pharmaceutical Compounding (2d ed.), APhA Pub., ISBN: 1582120358; Amiji and Sandmann (2002) Applied Physical Pharmacy McGraw-Hill Med., ISBN: 0071350764; Smith, et al. (2000) Pharmacokinetics and Metabolism in Drug Design (Methods and Principles in Medicinal Chemistry) Wiley-VCH, ISBN: 3527301976; Testa and Mayer (2003) Hydrolysis in Drug and Prodrug Metabolism: Chemistry, Biochemistry, and Enzymology Wiley-VCH, ISBN: 390639025X; Ansel and Stoklosa (2001) Pharmaceutical Calculations (11th ed.) Lippincott Williams and Wilkins, ISBN: 0781731720; Daniels, et al. (eds. 2001) Principles of Clinical Pharmacology Academic Press; ISBN: 0120660601; Gibson and Skett (2001) Introduction to Drug Metabolism (3d ed.) Nelson Thornes, ISBN: 0748760113; Ansel, et al. (1999) Pharmaceutical Dosage Forms and Drug Delivery Systems (7th ed.) Lippincott Williams and Wilkins, ISBN: 0683305727; Cannon (1999) Pharmacology for Chemists (ACS Professional Reference Book) Amer. Chem. Soc., ISBN: 0841235244; Woolf (1999) Handbook of Drug Metabolism Marcel Dekker, ISBN: 0824702298; and Feldman, et al (1996) Principles of Neuropsychopharmacology Sinauer Associates, ISBN: 0878931759. See also Frank and Hargreaves (2003) “Clinical biomarkers in drug discovery and development” Nature Drug Discovery 2:566-580.
Toxicology textbooks include, besides those used in standard academic or professional school courses, Moffat, et al. (2004) Clarke's Analysis of Drugs and Poisons (3d ed.) Pharmaceutical Press, ISBN: 0853694737; Hodgson (ed. 2004) A Textbook of Modern Toxicology (3d ed.) Wiley-Interscience, ISBN: 047126508X; Burczynski (2003) An Introduction to Toxicogenomics CRC Press, ISBN: 0849313341; Boelsterli (2002) Mechanistic Toxicology: The Molecular Basis of How Chemicals Disrupt Biological Targets CRC Press, ISBN: 0415284597; Rossoff (2001) Encyclopedia of Clinical Toxicology: A Comprehensive Guide Taylor and Francis Group, ISBN: 1842141015; Gad, et al. (2001) Regulatory Toxicology (2d ed.) Taylor and Francis STM, ISBN: 0415239192; Hodgson and Smart (2001) Introduction to Biochemical Toxicology (3d ed.) Wiley-Interscience, ISBN: 0471333344; Lewis (2001) Guide to Cytochromes P450: Structure and Function CRC Press, ISBN: 0748408975; Ford, et al. (eds. 2000) Clinical Toxicology Saunders, ISBN: 0721654851; Wexler, et al. (2000) Information Resources in Toxicology (3d ed.) Academic Press, ISBN: 0127447709; Wexler, et al (1998) Encyclopedia of Toxicology (3 Vol.) Acad. Pr., ISBN: 012227220X; Puga and Wallace (1998) Molecular Biology of the Toxic Response CRC Press, ISBN: 1560325925; and Sipes, et al. (eds. 1997) Comprehensive Toxicology (13 vols.) Elsevier, ISBN: 0080423019 (CD-Rom ed. ISBN: 008042306X).
Medicinal chemistry is a critical function in drug development, and is described generally, e.g., in Dingermann, et al. (2004) Molecular Biology in Medicinal Chemistry (Methods and Principles in Medicinal Chemistry) Wiley, ISBN: 3527304312; Silverman (2004) The Organic Chemistry of Drug Design and Drug Action (2d ed.) Acad. Pr., ISBN: 0126437327; Abraham (ed. 2003) Burger's Medicinal Chemistry and Drug Discovery (6 Vols. on Drug Discovery; Drug Discovery and Drug Development; Autocoids, Diagnostics, and Drugs from New Biology; Cardiovascular Agents and Endocrines; Chemotherapeutic Agents; and Nervous System Agents) Wiley-Interscience, ISBN: 0471370320; Lemke (2003) Review of Organic Functional Groups: Introduction to Medicinal Organic Chemistry (4th ed. with CD), Lippincott Williams and Wilkins, ISBN: 0781743818; Wermuth (ed. 2003) The Practice of Medicinal Chemistry (2d ed.) Academic Press; ISBN: 0127444815; Williams, et al. (2002) Foye's Principles of Medicinal Chemistry (5th ed.) Lippincott Williams and Wilkins, ISBN: 0683307371; Patrick (2001) An Introduction to Medicinal Chemistry (2d ed.) Oxford Univ. Pr., ISBN: 0198505337; Smith, et al (2000) Pharmacokinetics and Metabolism in Drug Design (Methods and Principles in Medicinal Chemistry) Wiley-VCH, ISBN: 3527301976; Dickson (1998) Medicinal Chemistry Laboratory Manual: Investigations in Biological and Pharmaceutical Chemistry CRC Press, ISBN: 0849318882; and King (1994) Medicinal Chemistry: Principles and Practice Springer-Verlag, ISBN: 0851864945.
Systems biology analyses and techniques are described, e.g., in Klipp, et al. (2005) Systems Biology in Practice: Concepts, Implementation and Application Wiley, ISBN: 3527310789; Kitano (ed. 2001) Foundations of Systems Biology MIT Press, ISBN: 0262112663; Bower and Bolouri (eds. 2001) Computational Modeling of Genetic and Biochemical Networks (Computational Molecular Biology) MIT Press, ISBN: 0262024810; Voit (2000) Computational Analysis of Biochemical Systems: A Practical Guide for Biochemists and Molecular Biologists (with CD-ROM) Cambridge Univ. Pr., ISBN: 0521785790; and other materials used in leading academic or professional school departments teaching courses in this area. Fundamental principles may include, e.g., coregulation suggesting functional relationship, and others which are based upon information theory and mathematics of complex systems, of which biology is one of the most complex. See, e.g., Abraham, et al. (2004) “High content screening applied to large-scale cell biology” Trends Biotechnol. 22: doi:10.1016/j.tibtech.2003.10.012; directed to cell biology, but systems biology indicates that cell biology affects organ physiology and system response.
Other references relevant to the subject matter of the present invention include, e.g., Cavalli-Sforza and Bodmer (1971) The Genetics of Human Populations Freeman, San Francisco; Babine, et al. (2004) Protein Crystallography in Drug Discovery (Methods and Principles in Medicinal Chemistry) Wiley, ISBN: 3527306781; Kumar, et al. (2004) Robbins and Cotran: Pathologic Basis of Disease (7th ed. with CD) Saunders Co., ISBN: 0721601871; Kubinyi, et al. (2004) Chemogenomics in Drug Discovey: A Medicinal Chemistry Perspective (Methods and Principles in Medicinal Chemistry) Wiley, ISBN: 352730987X; Block, et al. (2004) Wilson and Gisvold's Textbook of Organic Medicinal and Pharmaceutical Chemistry (11th ed.) Lippincott Williams and Wilkins, ISBN: 0781734819; Böhm, et al. (2003) Protein-Ligand Interactions: From Molecular Recognition to Drug Design (Methods and Principles in Medicinal Chemistry) Wiley, ISBN: 3527305211; Seydel, et al. (2002) Drug-Membrane Interactions: Analysis, Drug Distribution, Modeling (Methods and Principles in Medicinal Chemistry, Volume 15) Wiley-VCH, ISBN: 3527304274; Smith, et al (2000) Pharmacokinetics and Metabolism in Drug Design (Methods and Principles in Medicinal Chemistry) Wiley-VCH, ISBN: 3527301976; Wolff (ed. 1997) Therapeutic Agents, Volume 4, Burger's Medicinal Chemistry and Drug Discovery (5th ed.) Wiley-Interscience, ISBN: 0471575593; and Kennewell (1991) Comprehensive Medicinal Chemistry: General Principals Pergamon Pr., ISBN: 0080370578. Additional references of general medical relevance include, e.g., Berkow (ed.) The Merck Manual of Diagnosis and Therapy Merck and Co., Rahway, N.J.; Thorn, et al. Harrison's Principles of Internal Medicine McGraw-Hill, N.Y.; and Weatherall, et al. (eds.) Oxford Textbook of Medicine Oxford Univ. Press, Oxford.
Phenotypes are diverse, and relate e.g., to physiological, metabolic, behavioral, health status, disease state, development, and other functional or structural characteristics of a system. Features may be diverse as size, weight, color, function, histology evaluation, or other distinctive features of the system or parts thereof. The evaluation may be of the entire system together, or of parts thereof, e.g., function of particular organ subsystems or metabolic pathways. Among the phenotypes of interest herein include response to therapy, including efficacy, or toxicological response, including the standard adsorption, distribution, metabolism, excretion, and negative response to an administered drug or therapy. Negative responses are often characterized as adverse drug responses (ADR). The pharmacological features often evaluate the effects or response of the system to administration of a therapy, e.g., a chemical entity. The features will often evaluate over different organs or samples, depending upon accessibility and relevance of the samples. For example, in a lung disease context, samples generally considered relevant include blood, which may comprise cells, serum, or plasma; samples taken before and/or after therapy; biological cell samples, which may be biopsy, tumor, or tissue samples; fluid samples such as lavage or induced sputum samples, or postmortem tissue.
Expression evaluations need not be limited to single sample sites, but may evaluate comparative levels across relevant sample sources, e.g., blood and biopsy, or multiple organs, e.g., imaging of both liver and brain.
Phenotype correlation to specific genes is the subject of the science of genetics, and of the related fields of molecular biology or molecular genetics. See, e.g., Hedrick (2004) Genetics of Populations (3d ed.) Jones and Bartlett Pub., ISBN: 0763747726; Griffiths, et al. (2004) An Introduction to Genetic Analysis (8th ed.) Freeman, ISBN: 0716749394; Hartwell, et al. (2003) Genetics: From Genes to Genomes (2d ed.) McGraw-Hill, ISBN: 0072462485; Strachan and Read (2003) Human Molecular Genetics (3d ed.) Garland, ISBN: 0815341822; Lewontin, et al. (2002) Modern Genetic Analysis: Integrating Genes and Genomes (2d ed.) Freeman, ISBN: 0716743825; Klug and Cummings (2002) Concepts of Genetics (7th ed.) Prentice Hall, ISBN: 0130929980; Snustad and Simmons (2002) Principles of Genetics (3d ed.) Wiley, ISBN: 0471441805; Hartl, et al. (2002) Essential Genetics (3d ed.) Jones and Bartlett Pub., ISBN: 0763718521; Brown (2002) Genomes (2d ed.) BIOS Sci. Pub.; Nussbaum, et al. (2001) Thompson and Thompson Genetics in Medicine (6th ed.) Saunders and Company, ISBN: 0721669026; Alberts, et al. (2002) Molecular Biology of the Cell (4th ed.) Garland Pub.; Lodish, et al. (2002) Molecular Cell Biology. (4th ed.) Freeman; Haines and Pericak-Vance (eds. 1998) Approaches to Gene Mapping in Complex Human Diseases Wiley-Liss, ISBN: 0471171956; Kwok (2001) “Methods for genotyping single nucleotide polymorphisms” Ann. Rev. Genomics and Human Genetics 2:235-258; and Lin, et al. (2005) “Sequencing drug response with HapMap” The Pharmacogenomics Journal 5:149-156. Moreover, complex statistical methods are used in determining population genetics and correlations of phenotypes to specific genes.
Typically, the focus is on the correlation of phenotype to genetic elements, and typically specific alleles or other genetic haplotype markers. The correlation is measured by standard coefficients, and will typically be high, e.g., at least about 98%, 96%, 94%, 91%, 88%, 84%, 81%, 78%, 70%, 60%, 50%, 40%, etc. In many approaches, the alleles or haplotypes are represented by structural polymorphisms, e.g., which are more easily defined structurally in the form of nucleotide polymorphisms. Often the concept of Single Nucleotide Polymorphisms (SNPs) is substituted for entities which are conceived as genes, whether structural (encoding) or less well defined genetic features. The determination of genetic makeup, or the genetic details of an individual, is useful for such analyses. See, e.g., Lettieri (2006) “Recent applications of DNA microarray technology to toxicology and ecotoxicology” Environ. Health Perspect. 114:4-9; Wei, et al. (2005) “Data-driven analysis approach for biomarker discovery using molecular-profiling technologies” Biomarkers 10:153-172 (PMID: 16076730); Roelofsen, et al. (2004) “Proteomic analyzes of copper metabolism in an in vitro model of Wilson disease using surface enhanced laser desorption/ionization-time of flight-mass spectrometry” J. Cell Biochem. 93:732-740 (PMID: 15660417); Zhao, et al. (2004) “Identification of differentially expressed genes with multivariate outlier analysis” J. Biopharm. Stat. 14:629-646 (PMID: 15468756); Irwin, et al. (2004) “Application of toxicogenomics to toxicology: basic concepts in the analysis of microarray data” Toxicol. Pathol. 32 Suppl 1:72-83 (review; PMID: 15209406); Sarrif, et al. (2005) “Toxicogenomics in genetic toxicology and hazard determination: introduction and overview” Mutat. Res. 575:1-3 (PMID: 15924883); Sarrif, et al. (2005) “Toxicogenomics in genetic toxicology and hazard determination—concluding remarks” Mutat. Res. 575:116-117 (PMID: 15924887); Hamadeh, et al. (2002) “Prediction of compound signature using high density gene expression profiling” Toxicol. Sci. 67:232-240 (PMID: 12011482); Hamadeh, et al. (2002) “Gene expression analysis reveals chemical-specific profiles” Toxicol. Sci. 67:219-231 (PMID: 12011481); Hamadeh, et al. (2002) “An overview of toxicogenomics” Curr. Issues Mol. Biol. 4:45-56 (PMID: 11931569); Hamadeh, et al. (2002) “Detection of diluted gene expression alterations using cDNA microarrays” Biotechniques 32:322, 324, 326-329 (PMID: 11848409); and Hamadeh, et al. (2001) “Discovery in toxicology: mediation by gene expression array technology” J. Biochem. Mol. Toxicol. 15:231-242 (PMID: 11835620). Other polymorphisms than SNPs, e.g., those with lower than 1% frequency in the population, can be similarly valuable markers.
However, the phenotype resulting from a specific gene may be modified by the milieu of its environment, whether physical or biological. The classical Mendelian model of dominant or recessive alleles presumes that phenotypes are determined by single genes, and that the phenotype is not largely multifactorial. In contrast, multifactorial influences will more typically determine a phenotype, and the dominance or recessive feature of an allele or haplotype may be largely affected by the specific other alleles or haplotypes present, including regulatory or other functional determinants of outcome. For example, one allele may be amplified, modified, attenuated, or repressed by such other factors, many of which will be the one or other alleles present. See Yan, et al. (2002) “Allelic variation in human gene expression” Science 297:1143.
Often, alleles are considered to be defined by chromosomal location, and thus “different” alleles may be defined by alternative alleles found positionally on a chromosome. The term allele does not require that the sequence region be coding or “expressed” in a transcriptional or translational context. However, it is well recognized that occasionally gene duplications may occur, and the “duplicated allele” would then be categorized as an allele corresponding to the others. In other circumstances, there may be whole or partial chromosomal duplication (or deletion) effects, where allelic or gene dosage might be affected.
Alternatively, alleles may be entities which are highly related sequence-wise. In this case, a “mutation” would be considered an alternative allele, though they have different but closely related sequences. Thus, sequence relatedness may be another characteristic of allelic relatedness. Such alleles may exhibit, e.g., at least about 98%, 95%, 90%, 85%, 80%, etc., identity over appropriate segments. Identity may be determined using any appropriate method, including, but not limited to, the BLAST algorithm, as described in Altschul et al. (1990) J. Mol. Biol. 215:403-410 (using the published default setting, e.g., parameters w=4, t=17 as a non-limiting example).
The segments may often involve full coding region and adjacent regulatory segments, full coding region, segments of conserved sequence, e.g., domains, portions thereof, or a plurality of segments of appropriate length. Examples of polymorphisms which affect expression have been described. The segment, or plurality of segments, will typically be at least about 30, 40, 60, 80, 100 or more nucleotides, or correspond to at least about 15, 20, 25, 30, 40, 70 or more amino acid codons.
Functional relatedness may be another feature of alternative alleles. Thus, if one allele corresponds to a specific encoded enzyme (e.g., along the “one gene corresponding to one enzyme” model), another enzyme which can substitute functionally or structurally (e.g., in a multisubunit complex; as a related pharmacological binding target, or as a regulatory component) could be considered an alternative allele, even if it is not encoded at the same genetic locus. Thus, it will be useful to evaluate other allelic entities which might share substrate or reaction specificity and/or expression in similar or alternative organ or physical locations. It will be particularly useful to evaluate, e.g., for presence, quality, and/or quantity of, alternative entities which share similar substrate specificities, enzymatic turnover numbers or rates, and the like. These may include proteomic variants, such as glycosylation, phosphorylation, or regulatory variants.
Thus, the multifactorial component of phenotype actually suggests that the “combination of factors (or haplotypes)” is likely really the determinant of the end phenotype. And diagnosis of phenotype will be much more effectively achieved by determining the presence and absence, quantity, and quality of all relevant factors. Thus, correlation of phenotype to individual genes or haplotypes will be inherently less precise than to combinations of relevant haplotypes (typically referred to herein as “complement of haplotypes”). Thus, the statistical analysis goal will be to correlate phenotype to “all relevant factors”, rather than to single genes, or allelic pairings only. And the number of genes, coding regions, or discontinuous haplotype segments to be evaluated may run from about 5, 7, 9, 11, 14, 17, 21, and more. Discriminatory, classification, or substitute marker patterns diagnostic of phenotype will be identified using this process.
Penetrance of the “pattern of relevant factors” will also have influences. These will be factors which explain why clonal genetic systems (twins; genetically identical individuals) may exhibit variation in phenotype, perhaps for stochastic reasons. These will include, among many other things, the developmental aspects of the biological system (distant history), the recent history of the system (e.g., current environmental factors which affect the physiology or other biology, e.g., diet, stress, behavioral factors which affect the biology, hormonal factors, circadian factors, etc.), disease processes, medication processes, and other factors which affect the biochemistry, physiology, or other biological features of the relevant environment. In particular, the combination of therapeutic entities will be important, as drug-drug interactions often occur in individuals experiencing complex medical conditions.
However, since features which may be important often cannot be predicted, the best capability to have such documented are those factors which medical practitioners consider to be relevant. Thus, medical records are intended to provide the observations which have statistical likelihood, given prejudices of how biological systems operate, of explaining the phenotype. Thus, medical records are invaluable in the attempt to discover non-genetic factors which contribute to particular phenotypes. Alternatively, features which have been correlated with phenotype in other studies are likely candidates. See, e.g., Frank and Hargreaves (2004) “Clinical biomarkers in drug discovery and development” Drug Discovery 22:15-22 and related articles.
Systems biology is highly relevant in understanding the systemic aspects of physiology in an intact animal. In particular, most biochemistry is studied in and arguably two dimensional system, in mainly homogeneous solution with a time dimension. However, an organism has significant topological features, including intracellular organelles, internal organs, and interactions between organs which can interact via circulation and lymph. Thus, diverse regions of the body may contribute to toxicity, and also to disease. In particular, although toxicity may be manifested in a particular location, the underlying cause may be in one or several other remote organs or locations. Looking at the site of manifestation of symptoms may be looking at the wrong place.
A first application of systems biology will be to identify additional markers which are relevant to already identified markers. Pathway members upstream or downstream of an identified marker are likely candidates to also be relevant to the toxicity pathway, and are potential block points to progression or control. Pathway related entities may be found from many sources, including (1) biomarkers which have been reported to physically interact or co-localize with a candidate; (2) biomarkers which have been mentioned in a publication with a candidate (suggesting functional or structural similarity, whether a likely off-target functional or binding interaction with a therapeutic compound); (3) biomarkers which are similarly regulated in gene expression studies with a candidate in various organs, suggesting a true coordinate regulation; (4) biomarkers which are similarly localized in various organs, also suggesting coordinate regulation; and (5) biomarkers which are closely located physically on a chromosome, e.g., within thousands, tens of thousands, hundreds of thousands, millions, tens of millions, etc. nucleotides, or 0.1, 0.3, 1, 3, 10, 30, 100, 300, etc., centimorgans. Different combinations of these indicators may be used.
Alternatively, other entities known or reported to interact with a relevant marker are potential targets for regulatory intervention. Other “related” aspects of a marker may take the form of structural or functional variants of the marker, which may serve to change the kinetics or specificity of the pathway progression. Screening methods are available to screen likely aspects of identified biomarkers to evaluate DNA copy number, RNA expression levels, protein expression levels, features of the protein which are likely to affect function, including post-translational and similar modification (e.g., phosphorylation, acetylation, methylation, glycosylation, ubiquitination, etc.), or enzyme turnover numbers, half-life, and other similar features.
Existing expression data may be relevant to determine where to look. For example, many organs may be eliminated as relevant sites for analysis if the biomarker is not expressed in those sites.
Temporal dynamics in biochemistry are often poorly explored. While certain temporal dynamics are well recognized: neurobiology and ion flux changes over millisecond time spans, circadian rhythms of behavior and metabolism, menstrual cycles of hormonal changes over monthly intervals, and seasonal changes of hibernation and migration, the temporal aspects of toxicology have been little investigated. Dynamic aspects of diagnostic assays are often poorly understood, and many vary dramatically over such time periods. Gene expression profiling data will often be subject to much larger noise components than the signal, and the relative expression levels of genes may be lost in such cyclical variation. Thus, studying toxicity pathways in the context of dynamic physiology may uncover a heretofore unrecognized dimension to its understanding.
The dynamics of initiation, progression, and eruption of symptoms are not well understood. Tracking such progression, especially in a single individual, may allow identification of earmark patterns of features related to the biomarkers. Gene expression, protein expression, or metabolic function are features of likely relevance. Once such earmarks are recognized, the features may be used to monitor dynamically progression of the pathway in individual patients to monitor when the eruption of symptoms and predict timing of onset of symptoms. Management of the pathway then becomes more easily manageable, and can determine the timing of necessary actions to prevent or deal with the toxicity. Switching to a different drug or administering a preventative treatment may be in order.
Particular patterns of dynamics can be evaluated, based on sufficient time points and scales. Thus, if the progression of the toxicity pathway takes hours, the evaluation should establish baseline levels, trace it across a sufficient period of time, and probably at least follow through to full manifestation of symptoms. Sufficient numbers of analyses should be performed over the window, e.g., minutes, tens of minutes, fractions of hour, hours, fractions of days, days, weeks, months, or even years. Alternatively, time periods would be in the ranges of 1, 3, 10, 30, 100, 300, 1000, 3000, 10K minutes. Optimally, the manifestation of symptoms is sufficiently separated from earmarks that a monitoring system can identify with reliability, allowing identification of earmarks indicating onset of irreversible progression.
Typical dynamic patterns will encompass, e.g., constantly steady, steady change (increasing or decreasing), increasing and then decreasing, decreasing and then increasing, stable then changing, changing then stable, with time points for inflection being particularly notable. Differences between patterns characteristic of clinical phenotypes are of greatest interest. Often earmarks of events include a combination of patterns of different biomarkers.
As discussed, the penetrance of a defined genetic state is affected by genetic or non-genetic factors. The statistical analyses which can identify meaningful genetic features will be most successful where interfering noise is minimized, i.e., where identification of false positive factors or false negative factors will be minimized. The elimination of population heterogeneity will maximize the opportunity to recognize the signal over noise. Thus, analyses will be greatly improved and will be mathematically most efficient when performed on a homogeneous population of sufficient size. In contrast, the population should exhibit sufficient heterogeneity that the spectrum of phenotypes contained therein is reflective of a “global” population.
Studies in humans would be preferred, which eliminates issues of whether a non-human mechanism is relevant to a human. In human genetics studies, some compromise should be selected between a homogeneous population whose phenotypic variation totally fails to reflect mechanisms which may be present in other populations, and another extreme of a highly diverse a population where the non-genetic factors are such that the penetrance of the genetics might fail to be discernable. Statistical methods in genetic analyses are described, e.g., in the references on genetics described above, and in Ewens (2004) Mathematical Population Genetics (2d ed.) Springer, ISBN: 0387201912; Fernholz, et al. (eds. 2001) Statistics in Genetics and Environmental Sciences Birkhauser, ISBN: 3764365757; Svirezhev and Passekov (1990) Fundamentals of Mathematical Evolutionary Genetics (Mathematics and its Applications) Springer, ISBN: 9027727724; and Halloran and Geisser (eds. 1999) Statistics in Genetics (IMA Volumes in Mathematics and Its Applications) Springer-Verlag Telos, ISBN: 0387988289. Studies using genetically homogeneous populations are described, e.g., in Shifman and Darvasi (2001) “The value of isolated populations” Nat. Genet. 28:309-310; Kruglyak (1999) “Prospects for whole-genome linkage disequilibrium mapping of common disease genes” Nat. Genet. 22:139-144; and similar publications.
Homogeneous and/or large population sources for genetic studies are useful, preferably with medical details allowing subsetting, e.g., East Finland Population (Jurilab); Icelandic population (deCODE Genetics); Ashkenazi Jew population (see, e.g., familystudy@jmhi.edu; or Johns Hopkins School of Medicine); Sardinia (Shardna Life Sciences); Quebec (Genizon Biosciences); Mormon population (Utah Population Database (UPDB) and with Church of Latter Day Saints (LDS) records); Estonian Genome Project (EGen); Iranian Human Mutation Genebank and database (see Najmabadi, et al. (2003) Hum. Mutat. 21:146-50); UK Avon Longitudinal Study of Pregnancy and Childhood (ALSPAC); Costa Rica (see Service, et al. (2001) Hum. Mol. Genet. 10:545-551); Newfoundland (Newfound Genetics); and others. Features of particular interest to genetic homogeneity include, e.g., geographic isolation, homogeneity, founder effect, genetic drift and extended linkage disequilibrium (LD). See also Rahman, et al. (2003) Human Molecular Genetics 12:Review Issue 2 R167-R172; and Heutink and Oostra (2002) Hum. Mol. Genet. 11:2507-2515.
The homogeneity of a population is often difficult to define quantitatively, and the sampling is subject to difficulty in definition by inclusion of outliers, whose data may be disregarded or eliminated later based on results of more careful analysis indicating genetic outsider status. But the invention allows, by testing, means to confirm by SNP or other methods criteria for inclusion into the desired population. Although there are different ways to mathematically define linkage disequilibrium, it relates to how much of the genome is typically coordinately inherited. The means to define such are based upon the selection of markers to evaluate such, generally polymorphisms, generally referred to as single nucleotide polymorphisms (SNP), in nucleic acid sequence of the genome. With a selected set of SNPs, there are measures of the granularity of analysis, e.g., how homogeneous is dispersion of the markers. Both can be evaluated by a quantitative measured of “median” or “mean” intermarker distance. The ranges are typically in the thousands of KB separation, while high throughput microchip technology can provide generally from about 90K to 500K SNPs on a single sample analysis. With such resolution, and the size of the human genome, one gets about 4 KB mean separation. But based on the specific set of SNPs used, the median range may vary depending upon what regions may have higher or lower density SNPs selected.
The values for linkage disequilibrium range from 0 (no unusual linkage) to 1 (highest linkage), and may run in a range from 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, or 0.90 depending upon the granularity of regions being evaluated. Higher local LD values over greater distances are more significant than higher LD values over shorter distances. See, e.g.,; Shifman, et al. (2003) Hum. Molec. Genetics 12:771-776. The regions of greatest interest in these analyses will be specifically local physically to biomarkers discovered as described, e.g., in Tables 3 and 4 directed to liver toxicity markers.
As such, analyses of populations may also take the form of selecting amongst the data derived from subsets within the populations, with the “most homogeneous populations” being selected subsets of populations considered genetically homogeneous, e.g., with selection excluding individuals identified as being not within the characteristics defining the homogeneity.
A preferred population will be one comprising a founder population having a low number of founders, is traceable back through several generations (preferably at least about 5, 8, 12, 15, 20, or more), and will have comprehensive medical and historical information (preferably some medical records for most, other information on genetic relationships, e.g., church marriage and parentage records), a high rate of “inbred” population expansion, and be large enough to allow for sizable study cohorts. The number of founders preferably is less than about 20K, 15K, 3K, or even about 1500 or 900, often determinable by evaluating mitochondrial DNA or Y chromosome homogeneity. The number of substantially traceable generations preferably will be more than about 5, 10, 15, 20, 50 or more generations. Useful medical records preferably will exist and/or be available for at least 5, 10, 15, 20, 25 or more years, and the familial relationships substantially traceable for 3, 5, 7, 10, 13, 17, or 20 generations. The details of the medical records will range from limited, occasional events, e.g. only hospital admissions, to more frequent clinic visits; and details may range from complete medical records with associated diagnostic test results to limited annotations relating to sex, age, outcome, or the like. Often the particular sample itself provides annotations, e.g., sex organs may inherently subset, or can be readily determined by simple diagnostic procedure on the sample (e.g., presence of Y chromosome). For statistical purposes, the study population derived from the founders will preferably be at least about 70K, 140K, 220K, 300K, 500K, 800K, 1.1 M, 1.5 M, or more, and the phenotype numbers will preferably be large, e.g., at least about 5, 7, 10, 13, 16, 20, 25, 50, 100, 150, 200, or more examples. Adverse event reporting schemes may identify reports of at least about 5, 10, 20, 30, 50, 70, 100, 150, 200, 450, or more putative events.
In particular, genetically homogeneous populations allow for generation of hypotheses of correlations exhibiting low false positive rates, e.g., providing advantageous statistical power. Validating those hypotheses among larger and more heterogeneous sample populations is much simpler and less costly than doing the initial studies in the larger and more heterogeneous population. Statistical power in the latter population is poorer and will lead to many more false positives and spurious signals among the higher background noise statistics.
Alternative means to access samples for evaluation include clinical trial biobanks of sufficient size that phenotypically homogeneous cohorts can be selected. Those biobanks may be derived from clinical trial samples, or outside of a clinical trial context a large enough collection of relatively non-homogeneous samples but with sufficient annotation to select relevant cohort subsets. Biobanks, also known as human tissue banks or biorepositories, include various governmental efforts including the UK, Estonia, Canada, Norway, Sweden, US (including NIH), Iran, Singapore, Japan, Spain, Germany; private biobanks, e.g., deCODE, Oxagen, Galileo; tissue biobanks, e.g., Ardais, Genomics Collaborative, Astand, ILSbio; and disease focus banking efforts, e.g., heart disease, osteoporosis, bone marrow registries, breast cancer registries, foundations (cystic fibrosis, Parkinson's disease, etc.). See, e.g., Zimmerman, et al. (2004) Biobanks: Acclerating Molecular Medicine, IDC #4296; International Society of Biological and Environmental Respositories (www.isber.org); and International Biobank and Cohort Studies Meeting, Feb. 7-8, 2005, Atlanta, Ga. Alternative sample sources may include national, e.g., UK, Canadian, or China health care systems, Health Maintenance Organization samples (e.g., medical records associated with patients with permission to sample), health insurers, or medical centers.
The homogeneous human cohorts may be used either in a training subset, to generate a hypothesis, and/or to validate a hypothesis. The former is often much more difficult, as it makes fewest assumptions about the biology leading to toxicity pathway activation, while the latter can confirm a hypothesis which may have been generated from any source, e.g., animal model data. The number of cohort samples for the training set and/or validation set (separately or in combination) from the genetically homogeneous population is preferably at least 5, 10, 15, 30, 45, 65, 80, 100, 120, 145, 170, 200, 235, 270, 300, 330, 370, 410, 440, 470, 500, 600, or more.
The initial identified gene or haplotype dataset will identify genes believed to be genetically correlated with phenotype. With tools which allow, e.g., cross species comparisons of structure and presumptive function, the species from which the gene dataset is derived may be different from the species in which the biomarkers are desired. From the understanding of function or pathway networks (pathways and networks will typically be used interchangeably) in one species, structural correlation across species, and functional studies may be used to cross species boundaries. Thus, a gene identified in a rat species dataset would often be expected to have human counterparts, either structurally or functionally. This sets up a hypothesis which can be tested in a human based system, directly or indirectly. Typical pathways likely to be relevant to toxicology will include detoxification pathways (e.g., cytochrome P450s), transporters (influx/efflux), drug metabolizing pathways, and the like.
From a set of identified genes which are believed to be correlated with phenotype, the assortment of genes will point to pathways/networks of interacting genetic entities which relate to the phenotype. Many of the genes will be readily assigned to pathways. Most others can be assigned by use of ontogeny analyses, which will allow the identification of less clearly understood pathways, networks, or mechanisms. The few remaining genes which do not readily get assigned to a network or mechanism can be analyzed using systems biology and cross species structural and functional means to assign probable pathway linkages. This provides the means to apply datasets from one species, e.g., mouse, rat, dog, or primate, to another, e.g., human. However, the identification of a marker (or series thereof) in one specie would need to be applied to counterpart markers in another specie in the context of the pathways and physiology of that second specie.
These pathways and networks are then evaluated, with respect to its members, for the functions which are necessary for the pathway. These will include the development of the structural components, creation and regulation of the various components, and maintenance of the other functional features of the pathway or networks; eachof which relate to the phenotype.
With the identification of the components of the pathways or networks, hypotheses as to which components are critical points which would regulate or control the development or prevent the appearance of the phenotype. These hypotheses will then be testable to identify combinations of genes or biomarkers which contribute to the phenotype. Perturbation analyses and surrogate markers can be applied to determine which biomarkers possess maximum relevance to the phenotype.
Within the identified sets of genes whose expression is correlated to phenotype, many of the genes will be readily assignable to understood metabolic pathways or networks, and it will often be readily understood how that pathway can mediate the resulting phenotype. This identification can have a dramatic impact on recognizing that different pathways or networks have relevance in the phenotype which had heretofore remained unrecognized. Systems biology interactions between networks and biological systems will become better understood as the relevance of a pathway will be seen to impact seemingly remote phenotypes, e.g., how fundamental metabolic pathways or networks have impacts on multiple organ systems, or how pathways or networks recognized to affect one system actually also impact phenotype in the another system or remote body location. Conversely, many different pathways may affect the same phenotype in similar or related ways.
However, among the identified marker set, some fraction will not be readily assigned to or within a known pathway or network. In these situations, a predictive model will typically be developed in two steps. A first step might be characterized as identifying or developing an initial limited set of “preliminary” signatures for specific data types. A separable second step might be characterized as generating a predictive model that combines different types of signatures, and may incorporate more discriminating and higher resolution features which correlate more closely with the desired prediction.
As an example of a step I, consider gene expression measurements for several thousand genes (or, similarly, a proteomic profile, a metabolomic profile, or a mixture or combination of various forms of profiles) for each sample. An initial step might be data reduction, e.g., intending to focus on and identify a profile of few genes or features that account for a majority of the variation in the data. See, e.g., Joliffe (2002) Principal Component Analysis (2d ed.) Springer, ISBN: 0387954422; Krzanowski (2000) Principles of Multivariate Analysis: A User's Perspective (Oxford Statistical Science Series) Oxford Univ. Pr. ISBN: 0198507089; Gnanadesikan (1997) Methods for Statistical Data Analysis of Multivariate Observations (2d ed.) Wiley-Interscience, ISBN: 0471161195; Ramsay and Silverman (1997) Functional Data Analysis (Springer Series in Statistics) Springer, ISBN: 0387949569; Muirhead (1982) Aspects of Multivariate Statistical Theory (Wiley Series in Probability and Statistics) Wiley-Interscience, ISBN: 0471094420; Mardia et al. (1980) Multivariate Analysis (Probability and Mathematical Statistics) Acad. Pr., ISBN: 0124712525; and other texts on multivariate analysis.
One approach is to use Principal Component Analysis (PCA) to reduce the dimension of the data. The first few components (e.g., select the top 5) may account for a majority (say 90%) of the data. As one adds more components, the modeling improves, but the marginal improvement in the model will decrease with each additional component included (decreasing return).
Another approach is to group the data using a clustering technique. Many such techniques exist, with the objective to bin the data into clusters so as to maximize distance between clusters (e.g., to distinguish each cluster from its neighbors). See, e.g., Anderberg (1973) Cluster Analysis for Applications Academic Press; Hoppner, et al. (1999) Fuzzy Cluster Analysis Wiley, ISBN: 0471988642; Zhao (2004) “Evolutionary Computing and Splitting Algorithms for Supervised Clustering” Masters Thesis, U. of Houston (http://www.es.uhu.edu/˜zhenzhao/zhenghongthesis.zip); and Dettling and Buhlmann (2002) “Supervised clustering of genes” Genome Biology (2002) 3:1-15. Clustering techniques vary widely, and some are more successful than others at separating the various clusters.
Yet another approach is to use supervised clustering techniques. The objective here is to classify data using information other than what is contained in the data itself. For example, knowledge about a group of genes regulated by the same transcription factor can be used to group them together; another is to use medical record data to group data.
The end result of a step I might be an identified handful of genes or patterns accounting for most of the variations contained in the larger dataset. The patterns may be combinations of different markers, different forms of analyses (genotypic, RNA expression, protein expression, post-translational features, and/or metabolic features), different sites or organs, temporal features, and such. See, e.g., Abraham (2004) Trends in Biotechnology 22:15-22; and Frank and Hargreaves (2003) Drug Discovery 2:566-580.
In a step II, a predictive model may be built that combines several of these patterns. This might be characterized as a curve fitting exercise. The end point is the desired clinical outcome correlated back to a pattern or signature.
In one approach to this development of a predictive model, one can work backwards. For example, one may say that if a particular patient (data point) has genotype pattern X, expression profile Y (Y could be a combination of several distinct profiles, e.g., proteomic, or metabolomic), and clinical phenotype Z, then there is an n % chance (correlation coefficient) of toxic event. One simplistic modeling strategy would construct a model with X, Y, and Z as parameters with proper weight factors (correlation coefficients) to fit the population of data points. The weights will be chosen (assigned) to account for the known clinical outcome across the data points.
Alternatively, another approach would be the use of a neural net. See, e.g., Pearson, et al. (eds. 2004) Artificial Neural Nets and Genetic Algorithms: Proceedings of the 6th International Conference in Roanne, France, 2003 Springer, ISBN: 3211007431; Kurkova (ed.), et al. (2001) Artificial Neural Nets and Genetic Algorithms Springer, ISBN: 3211836519; Bishop (1995) Neural Networks for Pattern Recognition Oxford Univ. Pr., ISBN: 0198538642; and Nelson and Illingworth (1994) Practical Guide to Neural Nets Addison Wesley Pub., ISBN: 0201633787. In some such embodiments, the network is trained with part of the data; that training leads to a hypothesis; and the hypothesis is cross-validated against the rest of the data.
Yet another modeling technique to generate a predictive model is the use of decision trees. Here data is split iteratively in a tree like form with subsequent branches explaining more and more of the data and ultimately reaching a class. See, e.g., Mitchell (1997) Machine Learning McGraw-Hill, ISBN: 0070428077; and Duda, et al. (2000) Pattern Classification (2d ed.) Wiley-Interscience, ISBN: 0471056693.
Having applied the various alternatives here, the end result of a step II would be a model (or hypothesis) that weighs the inputs to construct a predictive model relating to a form of related network or pathway. This model would be constructed using part of the data (“training set”); while other parts of the data are used to validate the model.
Yet another means to reach understanding of a pathway will be based upon genomic information, evolutionary phylogeny, and related systems biology. It will often occur that a gene of unknown function has structural, sequence, or regulatory similarity to another in a different species. This may provide insight and testable hypotheses relating to the function or pathway in the original species, or other species may provide greater details into what other features or structures may exist in a pathway or network. Thus, knowledge databases linking data within a species or across species will be useful. There are many such databases, e.g., products similar to those provided by Ingenuity, Entelos, BioVista, Jubilant BioSystems, and the like. Similar offerings should be available from the NCBI, various European or other counterparts, or other websites. Various similar pathway collections should be directed to pathways relevant to specific disease states, physiological conditions, or biological subsystems (e.g., cardiovascular, digestive, respiratory, hormonal, etc.).
Identified genes may be correlated with phenotype (e.g., treatment outcome) to identify pathways, but individual genes or features within that pathway will each exhibit differing correlation coefficients from other features. The outcome correlation with such features may vary depending upon frequency of the feature in the selected study population, the frequency of various mechanisms of phenotype outcomes, the number of different mechanisms leading to a similar phenotype, and many other factors. However, preferred biomarkers are those which represent the most common mechanisms or pathways leading to the relevant phenotype, and within each of those networks, those biomarkers which exhibit optimal (e.g., high) correlation coefficients among various alternative markers (gene or otherwise) in the networks. Thus, optimal biomarkers or signatures may involve a plurality of different diagnostic measurements, and may include dynamic features.
Often, understanding of the pathway or network allows identification of features which are diagnostically most relevant to the phenotype, and may allow for identifying parameters and features which are directly relevant to the timing, severity, and progression of the phenotype. Thus, a form of reverse engineering will provide for generating a diagnostic strategy to fit the phenotype, and selecting the appropriate diagnostic parameters within that context.
In other circumstances, less about the pathway is known, and some experimental component may be useful to determine what features or factors are more directly relevant to the phenotype. In other situations, the pathway may be incompletely understood regarding relevant biomarkers, or the interactions or regulatory processes in its physiological function. These can be filled in by some combination of metabolic pathway analysis and systems biology analysis. The basic means to elucidate metabolic pathways follow the principles of elucidation performed in intermediary metabolism decades ago. Systems biology strategies have been referred to above. The pathway elucidation generally need not be performed in any specific species, as most pathways are shared across evolution, and the relevance of related species will typically be great. Thus, the proper species will often be selected on the basis of peculiar biological features in a species, access to tools, materials, and ease in confirming the relevance of that species to others. Fundamental pathways, e.g., those that mediate cell division, signal transduction, or processing of xenobiotics, are often conserved across the animal kingdom and in many cases between fungal and animal kingdoms as well.
The correlation between pathway or network specific features and phenotype will typically be most robust when the measurement directly evaluates a critical signature which is proximal to the phenotype (outcome result), and not a distant one. Being proximal, there will typically be fewer, if any, compensatory factors which can attenuate the end phenotype, and thus the statistical correlation coefficient will be highest (optimized). In addition, certain threshold level features, sensitivity and consistency in measurement within and across individuals and time, sampling/measurement errors, ease and speed of determination, local or centralized analyses, and other features will be technically preferred, providing cleaner measurement signal relative to noise. Other factors for selection of specific features may include economic efficiency of evaluation. As indicated above, combinatorial features will often be much superior to individual features. Temporal features may be identified, which may provide useful insight into the progression and dynamics of phenotype emergence.
Thus, analysis of the genes which correlate with a phenotype can lead to identification of mechanistic pathways which seem to be common, and thus lead to hypotheses as to mechanisms of the phenotype. Testing these hypotheses can be done much more readily with more favorable statistical precision and experimental design than creating a hypothesis from raw data, and various tests to eliminate particular models might be possible from historic or simple experimental procedures. Thus, initial identification of correlated genes to a phenotype (e.g., treatment outcome) will often be a first, but incomplete step to defining and understanding a mechanism to explain such treatment outcome (which may be an adverse drug event).
Once gene sets have been identified as correlated with a particular phenotype, analyses of those sets will point to particular pathways which mechanistically should relate to the phenotype. Thus, when the set contains many of genes within a particular functional pathway, the data strongly points to that pathway as an important component causing the phenotype. Typically, a phenotype may be caused by multiple alternative mechanisms, perhaps related or unrelated, but the collection of those mechanisms should teach the various alternative means or pathways which can lead to the phenotype.
Systems biology analysis techniques are then applied to the collection of genes identified by correlation analyses to lead to identification of those pathways. And as much as is known or can be presumed about the pathway itself can be collected. Analysis of those pathways, and understanding the networks of interactions and regulation, will reveal genes and reactions which are critical in the pathway leading to the phenotype. From the defined networks of genes and interactions will be identified features and signatures which can be hypothesized to contribute to the dynamic or temporal development of the conditions resulting in the phenotype. These hypotheses can then be tested, thereby providing signatures (single or combinatorial) with optimized correlation with outcome (particularly including significant genetic contributions), a desired temporal prediction (long before, intermediate, or immediately preceding) relative to phenotype, minimal noise, maximal diagnostic stability, high discrimination, and other desired features.
One means of testing can be by specifically applying diagnostic procedures to monitor a relevant system. Thus, an experimental system designed or recognized to involve the designated pathway may be evaluated to determine which are the bottleneck points or critical points for system stability. In vivo experimental models may be used, and often there exist in vivo models which represent a disease. In certain cases, surrogate markers may be used instead of phenotype readout, e.g., in humans, where ethical considerations may prevent direct observations of a phenotype or progression thereof.
Alternatively, in vitro models which may have surrogate markers may be used. However, the systems biology component will also be useful in pointing to what sample types (e.g., which organ or histological type) may be relevant to the phenotype exhibited by a different organ or system of the animal. The insights provided there may often lead to looking for the markers at a different location from where the phenotype first manifests observed symptoms.
Yet another method will involve experimental perturbation analyses to identify those biomarkers (and allelic variants) which can provide diagnostic measures exhibiting high correlation with phenotype.
In addition, the pathways may be evaluated to determine the main pathways which might cause effects which are manifested in the main organ systems of interest in clinical pathology or treatment. Among those systems, e.g., in the toxicology field, are the digestive, circulatory, respiratory, nervous, endocrine, homeostatic, skin, musculoskeletal, blood, urinary, and reproductive (male or female) systems. The organs comprising such systems can also be defined, e.g., in the area of toxicology the main organs of focus are liver, muscle, GI tract, bone marrow, CNS, respiratory, circulatory, and reproductive systems. Using systems biology analyses, it will become apparent that particular mechanistic pathways may cause phenotypes in different functional systems, and the site or type of manifestation of initial symptoms may depend on certain features which will be determined, as before, by genetic, environmental, physiological, or other factors, depending upon the individual. However, those factors will likely be identified only if they are features which could emerge from the medical record of a particular patient.
From the correlation of genes with phenotype, analyses of the genes will be performed to determine the functional roles of such genes. Typically, the genes can be characterized using software, e.g., Gene Ontology (CNIO bioinformatics unit), as being involved in different functional networks, and categorized among biological process, molecular function, or cellular component dimensions. For example, the genes may be involved in metabolism (e.g., enzymes, or regulation of enzymes and metabolic pathways), cellular physiological processes, cell communication, response to stimulus, regulation of physiological processes, organismal physiological process, morphogenesis, regulation of cellular process, death, cell differentiation, homeostatis, growth, protein synthesis, etc.
Studies in non-human species will provide enormous insight into the pathways which can be applicable to human phenotype. For example, studies on toxicity in rodents can, with these analyses, provide information as to what are important toxicity pathways in those rodents. With the genomic information available today, identification of relevant pathways in the rodent can be tested for equivalent relevance in a different species, e.g., human. The hypothesis that the same pathways are relevant in the second species can be then validated. And the corresponding species counterparts are reasonably easily determinable, especially with so much sequence data available, to allow validation of the hypothesis.
Identified genes which correlate with phenotype should cluster in relevant pathways, often networks of functional or structural features which relate to one another. Means to determine the function of genes can be derived from literature reports, genetic mapping studies, and others. Appropriate databases which link such include the Ingenuity, Entelos, Biovista, and Jubilant Biosystems knowledge management system databases, as described above. Descriptions of database offerings are available from simple interne searches.
With the identification of relevant pathways, identification of the specific “critical” factors or signatures, alone or in combinations, can be effected by perturbation analyses. Perturbation analysis will allow determination of whether a particular component is close to or remote from the critical phenotype determining parameter.
Large scale experimental perturbation analyses can be performed to identify those biomarkers which can provide diagnostic measures which provide high correlation with phenotype. Perturbations to proposed markers (and genetic variants) can be effected in an experimental system. Some modifications can be introduced in gene sequence or quantity (regulation); protein sequence, modification, or quantity; substrates or analogs thereof (including inhibitors or regulatory subunits); metabolic intermediates; time of endpoint or analyses; temperature; and isotopic variants. Other changes can be achieved by any of gene expression modifiers (including knockout or transformants), gene suppression (e.g., using RNAi or anti-sense), use of dominant negative forms or suppressors, and activating mutants. Some perturbations may be chemical perturbations, e.g., by varying concentrations of small molecule inhibitors, co-factors (natural or otherwise), or activators, or perturbations in measurements as a function of time. See, e.g., KineMed (www.kinemed.com), in which kinetic features are studied in fundamental problems in disease management and drug development.
Although it is widely recognized that biological systems are characterized by flux (polymers are synthesized and degraded; metabolites traverse pathways to provide energy; cells die and are replaced), virtually no tests in biomedicine currently measure dynamic fluxes of molecules. Contemporary drug development/diagnostic strategies ignore this approach, analogous to the state of photography before the development of moving pictures, or of moving pictures before the introduction of sound.
This view rests, in part, upon the hypothesis that the operational unit of function in complex biological systems is neither the gene nor the protein, but rather is the flow of molecules through metabolic pathways in fully assembled living systems. In the final analysis, it is the flow of molecules through synthetic, catabolic, and intermediary metabolic pathways that is responsible for phenotype.
For example, assays may combine highly sensitive mass spectrometry and the labeling of critical molecular pathways with stable, non-radioactive isotopes in living organisms, including humans. The development of this technology allows for the measurement of molecular fluxes in metabolic pathways critical to human health and disease. The stable isotopes can be delivered by many routes of administration, are safe for use in humans, and the isotopic enrichments of a number of metabolic pathways may be determined in a high-through put manner. These kinetic assays have been broadly applied to a vast array of human disease states.
Perturbation analysis is described, e.g., in Jansen (2003) “Studying Complex Biological Systems Using Multifactorial Perturbation” Nature Reviews Genetics 4:145-151; Bowr and Bolouri (2001) Computational Modeling of Genetic and Biochemical Networks MIT Press; Kanji (1999) 100 Statistical Tests Sage; Collado-Vides, et al. (1996) Integrative Approaches to Molecular Biology MIT Press; Adriaans and Zantinge (1996) Data Mining Addison-Wesley; and Everitt and Dunn (1991) Applied Multivariate Data Analysis Arnold.
The optimized signatures may be in humans or non-human species, but will often be essentially surrogate markers for the phenotype. Where the signatures in the model systems have not been directly demonstrated in human systems, validation must be performed. However, given systems biology analyses and genomic data, the optimization might be performed in a non-human or quasi-human context. Further studies necessary for conversion of those signatures into the corresponding human systems may be minimal.
However, the methodology may also work backwards to establish that certain experimental systems can be directly relevant to humans with accepted surrogate markers. When it is established that an experimental system is diagnostic of whole organism human phenotype, the experimental model then can be used to test candidate therapeutic treatments or entities. The “experimental” feature then allows one to test new clinical candidates, rather than being limited to using approved entities in humans for determining phenotype, e.g., therapeutic response.
The experimental systems may be, e.g., in vitro or in vivo, and can be genetic, developmental, physiological, or other systems useful as models of disease or conditions. The models may be based upon the correlation of the optimized signatures back to the similar signatures detected in humans in a whole organism context, where the various functional or structural systems are intact and interacting. This can lead to surrogate markers or signatures applicable to experimental systems, but which are linked to intact human outcomes.
Additional input will derive from computational modeling (e.g., combining chemical structure with biological outcome; e.g., Leadscope technology, Multicase, DEREK, and the like), which will include data mining, structural alert determination, statistical correlations, fragmentation of chemical structures, expert rules, and the like.
Thus, cell lines or systems may be used, including alternative species, or human cell lines. The cell lines or systems may be human, transformed, transfected, or modified to exhibit features characteristic of the human phenotype, including features of human disease or pathological conditions. The cell lines or systems, including derivatives from stem cells, will generally be designed to provide readable signatures, as identified using the processes described, which can provide useful correlation back into the intact human systems. And the curve fitting component of the model building inherently relates back onto intact human data, with medical records and clinical input. Consensus biomarkers can be selected from the lists of markers from the various tables, datsets, and subsets. Particular markers can be selected which are either conserved across different datasets, or are relevant to common mechanisms of toxicity manifesting symptoms among multiple organ systems. For example, certain liver markers evaluate cell types found in the liver, e.g., PBMC, mucosa, or other cell types found also in other sites. Thus, conservation of markers may reflect (a) similar pathways operating in different target organs and/or (b) evaluating markers in different sites may actually be evaluating cells which are commonly found in the both sites. Similarities across samples, e.g., in physiology, function, structure, gene expression, and/or developmental origin (e.g., ectoderm, mesoderm, and endoderm), should often reflect similarities in metabolic pathways and potential for mischief. Thus, gene expression similarities will typically cause similarities in biochemistry and toxicology responses. For example, highly vascularized tissues (which may include liver, kidney, and many immune organs such as the spleen, thymus, bone marrow, lymph nodes) will typically show significant overlap in gene expression and physiological responses to contributing vascular components such as blood, vessels, and related mucosa. Likewise, quickly growing tissues such as bone marrow, immune organs, gastrointestinal tract, and skin will commonly express cell division related pathway components. Thus, biomarkers relevant to one organ would often be expected to be relevant to other organs sharing similar physiology, function, or gene expression of relevant or related pathways.
Once signatures are identified, as described, those signatures may be applied to organs or subsystems. The subsystems may include, among many, ex vivo organ or system studies, in vivo non-human organ models, cell lines or collections thereof, e.g., whose physiological or biological outcomes may simulate the range of population diversity, robotic or parallel assay methods, including “laboratory on a chip” systems for testing parameters (as identified within the signatures), and other means to test the range of responses to treatments.
The result of developing experimental models with surrogate markers and endpoints will be systems which allow for testing, screening, and accurate outcome prediction of high cost clinical studies from simpler and better correlated models. Instead of spending tens of millions of dollar at later stages of drug candidate development, reliable experimental feedback at earlier stages will allow prioritization of competing clinical candidates for few clinical development program slots. And reliable feedback early in the process will increase the success rate of candidates entering into the pipeline.
Moreover, with the development of classifier marker signatures, evaluation of test candidates for phenotypic outcome can be more easily performed. This may be applicable to testing new intervention therapeutics, or to develop more targeted toxins, e.g., in targeting tumor physiology. General environmental toxicity may be addressed using primate or other species.
Beyond the capability of using the experimental models for evaluating phenotype, such as toxicity, the models may be developed with specific disease or medical conditions as targets. The impact from the disease state will also be incorporated into the system so that the readouts are taken in the context of the biology and physiology existing in the clinical condition. There will be enormous advantages in performing the assays in the models simulating the context of the desired target biology.
Thus, where there are genetic or physiological models of the clinical condition, the biomarkers will be useful both in helping to determine the relevance of the proposed models and in evaluating such models to determine what treatments have positive effects on the “surrogate” markers. Various technologies may be utilized to evaluate where and when various metabolic systems are important. Imaging methods may be developed to evaluate signatures internally at selected sites to monitor or otherwise identify either adverse responses or to monitor disease development or progression.
It will be recognized that similar methods may be used to focus on mechanisms of specific diseases or conditions, and is not necessarily limited to application to liver toxicity mechanisms. Thus, the methods may be used to study defined phenotypes, whether classified together by medical practice, or clustered together by molecular definition of condition. The ultimate goal is molecular definition of conditions, distinguishing different mechanisms leading to similar symptoms, and reaching the possibility of personalized treatment of defined conditions.
Fixed time point diagnostic products should result from identification of markers, e.g., from non-humans, using genomic data and systems biology to point to human gene, protein, and metabolomic counterparts. Extension from single markers to multiple combination markers, e.g., including evaluation of attenuating factors derived from further genetic correlations, medical records data, non-genetic markers, disease or medical condition factors, behavioral or life style factors, and other correlations will lead to better quality and higher accuracy diagnostic packages. Development of hypotheses in non-human species, along with access to human population phenotype or outcome endpoints will assist in generating hypotheses which will lead to efficient statistical and cost effective validation for diagnostic products and methods. Human studies will also allow identification of new biomarkers and products derived there from.
Dynamic (multiple timepoint) monitoring should result from recognition of the dynamic signatures culminating in phenotype, both immediate and distant future events. This understanding should allow for continuous monitoring diagnostic devices and methods to track the progression of risk development over time. This, combined with the combinatorial analysis, should provide important new perspectives on how diagnostic strategies and patient monitoring will be performed.
Methods and constructs utilizing the biomarkers and pathway and network information will be incorporated into experimental systems. These will be incorporated into in vitro and other assays for testing treatments. Classifier or substitute measures for phenotype will result, along with better refined measures coming from focus on critical, robust, and combinatorial signatures. Dynamic factors will be better understood, and predictive methodologies will be developed.
With the complex combinatorial and dynamic biomarkers identified, as described, adoption into models of disease or condition can be better monitored for signatures correlated to phenotype. Thus, surrogate signatures may be validated and become acceptable means to use experimental subsystems, in vitro systems, or in vivo components for acceptable phenotypic readouts. Such experimental systems may often incorporate human components, e.g., genes, regulatory structures, etc., or be based upon human systems, organs, tissues, or the like, with other components from animals. Mechanical systems may be incorporated to evaluate titrations over time or concentration. High throughput screening or testing systems will evaluate optimized surrogate signatures to provide useful information on human response, or to identify dangers to carefully monitor in the whole animal or human organism context.
Alternative systems include animal cell lines or systems as models for animal testing. Certain ones may incorporate human components, e.g., genes identified as critical points, to evaluate factors in human biomarker interaction. Certain animal disease models can incorporate human features, and ultimately human disease models may be generated, e.g., based on genomics and systems biology. Counterpart human tissue systems or in vitro cell systems may be combinatorially combined to develop information on the behavior of the human systems of relevance to the human disease or medical condition. These systems, alone or in combination, will lead to signatures useful for diagnosis, monitoring, or surrogate readouts for phenotype.
Besides screening or testing, these systems will also be useful for evaluation of therapeutic index of treatments. The treatments may be tested in combinations, thereby providing useful insights into combination drug interactions. As many current drug problems result from peculiar interactions of multiple drugs, these systems provide experimental means to evaluate or model, with some statistics, outcome phenotypes. These will be early attempts at providing useful experimental models and biomarkers useful to model disease situations.
These means for evaluation of phenotype, e.g., system interactions shall lead to applications directed to monitoring individuals undergoing treatment. Thus, knowing contributing genetic factors contributing to a particular phenotype may allow subsetting of patients (categorizing of patients in subsets) or potential patients into those exhibiting low or higher risk from treatment, and even to predict timing of onset of problems. This will be useful in the therapeutic context in determining what alternative treatments would be indicated, when they should be applied, and/or when danger has subsided from primary treatment so return from an alternative is safe.
In addition, the present invention provides means to evaluate or prioritize early drug candidates for clinical success. Early determination of risk for efficacy and toxicity of a drug candidate is important in prioritizing resources for pipeline drug development. Because of the enormous costs of preclinical and clinical testing, any reliable information early on can be critical in early decisions on which candidates to continue and which to terminate, or prioritization among alternatives. Because late termination is so costly, both economically and in lost time, many drug development programs are abandoned after failure of the top (first) candidate. Second or subsequent candidates often never progress even to the preclinical testing required for market approval, and thus are eliminated from the potential pharmacopeia. Thus, reliable indications of outcome of these tests can raise or lower a particular candidate among alternatives in a development program. The present invention will allow this.
Moreover, with the more robust diagnostic signatures indicating phenotype, or outcome, the invention allows for means to rescue drugs at risk for market withdrawal. If accurate and reliable diagnostic signatures can be identified which subset patients or potential patients into low and high risk sets patients for treatment, adverse drug events may become again rare and idiosyncratic situations. Rescue from market withdrawal by capability to identify target groups can often result. Less expensive testing of combinations of drugs and more information on the mechanisms of drug adverse events will result in better understanding of how different individuals respond to particular treatment regimens.
Computer systems are important in being able to handle and analyze the enormous amounts of information, and to process and summarize the results. The present invention begins with the means and strategy to identify the likely candidates for large scale genome evaluation. By narrowing the search from some 30K human genes down to a small fraction (0.2-2%) for defined organ toxicity and/or mechanisms, the task of looking for appropriate features corresponding to those markers is dramatically decreased. The computer means to do the correlation have been described in detail, here and elsewhere. Many textbooks and the patent literature describe those in some detail.
Specific forms of toxicity are approached, looking for targets or pathways relevant to specific mechanisms focused, e.g., on liver, muscle, neurologic, bone marrow, gastrointestinal, kidney, skin, immune system, etc. Within each of these mechanisms, the relevant genetic markers are identified. Storing these data into files on a computer then can direct where and how to look for each type of toxicity. With a catalog of the different forms and mechanisms of toxicity, and a map of the relevant classifier biomarkers, the system then provides enormous power to catalog the modes of where to look within the organismic system for the earmarks of the toxicity pathway from initiation, progression, and outcome.
Computer systems applicable to biological applications are described, e.g., in Skierczynski and Schonk U.S. Pat. No. 6,934,636; Nahum and Stanislaw U.S. Pat. No. 6,695,780; Otvos U.S. Pat. No. 6,653,140; Singh U.S. Pat. No. 6,560,541; Stults, et al. US Pat App 20060027744; Hall and Gordon US Pat App 20040253215; and Heller, et al. US Pat App 20040236603, each incorporated herein by reference. Particularly relevant are computer systems which can evaluate features which allow subsetting patients into various risk groups, e.g., using means to determine personal risk with various therapeutic alternatives. Often some features may include medical record data, genetic data, historic medical information, and additional diagnostic features.
Computer systems will incorporate or utilize the files which catalog and link forms or pathways of toxicity, e.g., with data underlying the classifier biomarkers. Scanning through the classifier biomarker sets, common biomarkers which can indicate toxicity in various organs or locations can be identified, leading to selection of features or parameters which can evaluate the status of toxicity pathways across a wide range of locations of biological samples. Samples from different organs or locations may be evaluated on a common evaluation platform to simplify testing.
Other circumstances may require continuous monitoring of particular features. Dynamic patterns of features may show earmarks of lack of toxic effect, initiation, progression, and unavoidable toxic response. Dynamic monitoring may allow identification of when symptoms will become serious, and when certain therapeutic interventions must be substituted or changed as progression approaches irreversibility. Alternatively, pathway progression may be blocked by therapeutic intervention, e.g., with another approved drug, or known intervention (diet, other treatment). Computer systems to identify what to evaluate and when will either contain files which point out critical correlations, or are based on programs inherently using such information.
Thus, the invention provides files which identify or list relevant biomarkers linked to the specific toxicity mechanisms studied. These files will be incorporate into computer systems, directly or indirectly, through software. The patterns of genotypes, gene expression, protein expression, protein modification, post-translational modification, RNA features, and the like will be contained in similar files.
With identified classifier biomarkers, whether based on SNPs, other genetic elements, or other features of expression or function, there should be commercial opportunities for diagnostics based thereon. Diagnostic products, services, and related commercial opportunities will result when the underlying genetic or physiological bases of toxicity are understood. Knowing where, when, and how to look can tell who may experience various categories of risk. Specific testing may subset target patients into those who are more or less likely to respond negatively to a particular therapy or drug regimen.
Also recognizing appropriate markers indicative of toxicity mechanisms will allow use in therapeutic or drug development efforts, e.g., as diagnostic complements to subset patients by efficacy, risk, or toxicity pathway activation. Monitoring over time will allow recognition of the features where the toxicity pathway will erupt. Test systems evaluating identified markers which exhibit high correlation to safety in intact systems may be useful to test compounds early in development efforts. Prioritizing compounds in development before they are administered to humans will allow upgrading of candidates early, and accurate toxicity evaluation long before expensive clinical testing is reached. Moreover, test systems which approximate disease states will result, leading to better predictive systems for both toxicity testing in the context of background disease and test platforms to determine response to relatively rarely occurring clinical situation, e.g., combinations of treatments or response to combination drug therapies.
Understanding mechanisms and/or pathways of toxicity initiation or progression present the possibility of intervening to block the response. Particular therapeutic means to block ADR progression, e.g., by increasing a clearance mechanism, inducing a bypass mechanisms to shunt toxic entities (whether primary compound or metabolically toxified) or blocking a secondary target, may be identified. Alternatively, monitoring the progression in a patient may allow safer continuing administration before termination, e.g., allowing less frequent substitution of alternative therapy or the like.
In addition, better understanding of the pathways of toxicity will provide better insights into the pathways of disease. Similar strategies to understand toxicity pathways will be applicable to understand disease pathways. Methodologies can be developed to analyze toxicity or other pathways, combining (1) the genetic correlation to combinations of diploid haplotype or allelic biomarkers, often in collections of classifier biomarkers; (2) systems biology understanding of the pathways and alternative entities to bypass or continue physiological functions, and recognizing where in the organism (which organs) and the features of biomarkers to evaluate; (3) evaluating dynamic patterns which will be useful to identify earmarks of absence, initiation, progression, or past status of the pathway; and (4) using homogeneous genetic populations with medical records or large sample banks allowing selection of phenotypically homogeneous collections for analysis.
Dosing regimens, or combination drug dosings, may be evaluated or monitored. Threshold toxic levels of combination treatments can be established, monitored, or identified, allowing combination therapies to affect a common target, but having sub-threshold negative effects. Timing aspects of pharmacology may be much better defined and carefully monitored individually, as relevant to specific patients.
Computer simulation may allow prediction of toxicity response in humans, as computer models today allow aerodynamic design formerly requiring wind tunnel tests.
While much of the discussion herein refers to human therapeutic targets, the same applications will be easily used in the context of veterinary treatment. Thus, the methods will not be limited to human analyses, but will be applicable to other groups, e.g., mammals, primates, species typically used in clinical testing, e.g., rats, mice, dogs, cats, chimpanzees and other primates and subprimates; to various types of animal functions, e.g., companion (dogs, cats, rabbits, etc.), food (birds, goats, sheep, cows, pigs, snakes, etc.), work (elephants, camels, ox, llamas, horses, dogs, etc.), and show animals (horses, aquatic animals, etc.); to structural categories, e.g., quadrupeds, bipeds, flying animals, aquatic animals; to particular subsets of species including standard experimental species from fungi (including neurospora, yeast), prokaryotes, protozoa (e.g., malaria, trypanosomes, etc.), in plants, insects (flies, water flea, pests), worms (nematodes, segmented or otherwise), invertebrates or vertebrates, and other creatures. In particular, some applications of the invention to “pests” might be to find more toxic substances which have little or no effect on other species.
In particular, the present invention is directed to various methods, both for analyses and for diagnosis. It is intended that methods where one or more steps are performed outside of the jurisdiction of a country where information is gathered, analyzed, used, or treatment decisions are made. For this reason, methods where the information is communicated to persons within a legal jurisdiction are described, including where the persons are a patient, health care professional (human or veterinary), health care insurer or auditor, or drug marketing or regulatory agency. The information may be transmitted, e.g., in written, oral, or coded forms, or in analog, digital, or encrypted forms.
In addition, devices designed for use in these methods are also encompassed by the invention. Thus, the cell lines, systems, and the like used in these analyses are incorporated; as are kits and diagnostic systems used in manual, automated, robotic, systems. Preferably, the systems will provide results rapidly, reproducibly, and with minimal manual handling, e.g., which will minimize variability and promote diagnostic validation.
Having now generally described the invention, the same will be more readily understood through reference to the following examples which are provided by way of illustration, and are not intended to be limiting of the present invention, unless specified.
Methods for genetic analysis are well known, especially in the era of microarray analysis of genes. Specific evaluation of the full intact sequence of alleles is also common, which may include full sequencing, hybridization to selected probes, PCR analysis, and others. General methods of molecular biology are well known. See, e.g., Ausubel (ed.), et al. (2002) Short Protocols in Molecular Biology (Short Protocols in Molecular Biology; 5th ed.) Current Protocols, ISBN: 0471250929); Sambrook, et al. (2001) Molecular Cloning: A Laboratory Manual (vol. 1-3) CSH Lab. Pr. and affiliated www.MolecularCloning.com site that is evolving into an on-line laboratory manual; Cutler (ed. 2004) Protein Purification Protocols (Methods in Molecular Biology; 2d ed.) Humana Press, ISBN: 1588290670; Coligan, et al. (2001) Current Protocols in Protein Science Wiley, ISBN: 0471356808; Dickson and Mendenhall (eds. 2004) Signal Transduction Protocols (Methods in Molecular Biology; 2d ed.) Humana Press, ISBN: 1588292452; Waldman (ed. 2004) Genetic Recombination: Reviews and Protocols (Methods in Molecular Biology) Humana Press, ISBN: 1588292363; Schneider (ed. 2000) Chaperonin Protocols (Methods in Molecular Biology) Humana Press, ISBN: 0896037398; van de Heuvel (ed. 1997) PCR Protocols in Molecular Toxicology CRC-Press, ISBN: 084933344X; Fan (ed. 2002) Molecular Cytogenetics: Protocols and Applications (Methods in Molecular Biology) Humana Press, ISBN: 1588290069; Selinsky (ed. 2003) Membrane Protein Protocols: Expression, Purification, and Characterization (Methods in Molecular Biology) Humana Press, ISBN: 1588291243; Theophilus, et al. (2002) PCR Mutation Detection Protocols: Methods in Molecular Biology (Methods in Molecular Biology) Humana Press, ISBN: 0896036170; Wise (ed. 2002) Epithelial Cell Culture Protocols (Methods in Molecular Biology) Humana Press, ISBN: 0896038939; Brownstein and Khodursky (eds. 2003) Functional Genomics: Methods and Protocols (Methods in Molecular Biology) Humana Press, ISBN: 1588292916; Aguilar (ed. 2004) HPLC of Peptides and Proteins: Methods and Protocols (Methods in Molecular Biology) Humana Press, ISBN: 0896039773; Helfrich and Ralson (eds. 2003) Bone Research Protocols (Methods in Molecular Medicine) Humana Press, ISBN: 1588290441; Janzen (ed. 2002) High Throughput Screening: Methods and Protocols (Methods in Molecular Biology) Humana Press, ISBN: 0896038890; Killeen (ed. 2001) Molecular Pathology Protocols Humana Press, ISBN: 0896036812; Sioud (ed. 2004) Ribozymes and siRNA Protocols (Methods in Molecular Biology; 2d ed.) Humana Press, ISBN: 1588292266; and other volumes in the Humana Press Methods in Molecular Biology/Molecular Medicine series (see www.humanapress.com or BioMedProtocols.com); in the Methods in Enzymology series; in the periodical “Nature Methods”; or the like.
Biological samples are collected from appropriate subjects, e.g., animal or human. These may be human patients, e.g., persons exhibiting a phenotype, e.g., unfavorable drug effects. Conversely, the persons may be identified as persons experiencing no unfavorable drug effects, i.e., are low risk patients. Subsetting of patients into the classifications of unfavorable or lack of unfavorable drug effects will be useful, and the statistical analysis typically requires both in blinded analyses. Collection of associated medical data or the like is very useful, e.g., including behavioral, life style, and associated medical, disease, or treatment information. Samples are often banked as part of a clinical trial, and associated medical records can be of great annotation value.
Experimental animal subjects may be preferred for certain studies, as many fewer limitations exist for sampling. Animal sampling typically can be both more invasive, generally not limited as to type or amount, and will generally be less expensive. Human studies have limitations provided both by ethical (type, amount, purpose, consent) and economic concerns.
The samplings may include one or more types, e.g., liquid, cellular, serum, tissue, hair, skin, fluid, etc., materials to evaluate genetics, expression, metabolism, or the like, of appropriate biomarkers. Samples will preferably be immediately evaluated, or may be preserved after appropriate treatment for later analysis, e.g., freezing, fixation, or other preservation methods, consistent with the type of analysis to be applied. Animal samples may be easier to evaluate, archive, and evaluate for different parameters at a later date.
Preferably the target or sample population will be a homogeneous population, exhibiting low genetic diversity and minimal introduction of genetic diversity from outsiders. Statistical concerns should be recognized, so the statistical power of the study can provide useful conclusions. Genetic analysis of small numbers of individuals in such a population will point to specific biomarkers which will suggest pathways likely to be implicated in phenotype, e.g., therapeutic outcomes. But the correlations may be weak and indistinguishable from noise. Thus, large homogeneous populations linked to medical records and related data are particularly useful, e.g., the Icelandic or a similar population. Alternatively, selection of banked samples may be based upon similarity in phenotype or genotype.
Alternatively, animal studies may be used, which will be useful in identifying gene markers correlating to particular phenotypes. The relevance of animal studies to human phenotypes is a consideration in study design.
Datasets are accessible evaluating toxicity in various organs. The first is liver toxicity, studied in rat, mouse, and dog. Other organ toxicities of interest include muscle toxicity (fatigue, pain, and cardiovascular muscle problems), CNS, bone marrow (immune system and other effects), GI tract (which similarly has rapid cell replication), kidney (clearance function), skin (fast replication), and lung (enormous surface area).
Many diagnostic methods are useful for evaluating biological samples, preferably using Good Laboratory Practices. See, e.g., Nicoll, et al (2003) Pocket Guide to Diagnostic Tests (LANGE Clinical Science, 4th ed.) McGraw-Hill Medical, ISBN: 0071411844; Gallin (2002) Principles and Practice of Clinical Research Academic Press, ISBN: 0122740653; Daniels (2002) Delmar's Manual of Laboratory and Diagnostic Tests Thomson Delmar Learning, ISBN: 0766862356; Anderson (2002) GLP Essentials: A Concise Guide to Good Laboratory Practice (2d Ed.) CRC Press, ISBN: 1574911384; Weinberg (2002) Good Laboratory Practice Regulations (Drugs and the Pharmaceutical Sciences: a Series of Textbooks and Monographs, 3d ed.) Marcel Dekker, ISBN: 0824708911; Springhouse (2001) Clinical Laboratory Tests: Values and Implications (3d Ed.) Lippincott Williams and Wilkins, ISBN: 1582550816; Abraham, et al. (2004) Trends in Biotechnology 22:15-22; and Frank and Hargreaves (2003) Drug Discovery 2:566-580.
High density microarray evaluation of genes is quite attractive, e.g., using Affymetrix or similar technologies. See, e.g., Baldi, et al. (2002) DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling Cambridge University Press, ISBN: 0521800226; Simon, et al (2004) Design and Analysis of DNA Microarray Investigations Springer, ISBN: 0387001352; Knudsen (2004) Guide to Analysis of DNA Microarray Data (2d ed.) Wiley-Liss, ISBN: 0471656046; Speed (ed. 2003) Statistical Analysis of Gene Expression Microarray Data Chapman and Hall/CRC, ISBN: 1584883278; Draghici (2003) Data Analysis Tools for DNA Microarrays (Bk and CD-ROM) Chapman and Hall/CRC, ISBN: 1584883154; Schena (2002) Microarray Analysis Wiley-Liss, ISBN: 0471414433; and others. Additionally, lower density arrays or individual bead or PCR analyses platforms may be used.
In one embodiment of the invention, once genes of interest have been identified, PCR analysis is performed to determine the “universe” of corresponding alleles in the human population. The region of the alleles can be localized to relatively short segments of chromosomal sequence, perhaps some “few” kb in length. Alternatively, RNA analysis may also be performed with the introns spliced out and evaluate RNA sequences. Identifying the specific alleles being expressed (among a known universe of possibilities) may also include “PCR type” amplification steps to reduce background noise. Appropriate primer are selected and used to address the relevant region of the genome. That region would be amplified, and the other portions of the genome fall out (reducing background noise). For example, selected primers may vary among up to 10-15 different specific allele sequences. Numerous dyes are available to determine which pairs out of the possibilities have been used (e.g., current FACS systems can distinguish over 10-15 different fluorescent wavelengths. Primers which hybridize to each of the polymorphisms, we can incorporate (or hybridize) differently labeled primers to determine the 15 different primers which have been incorporated. The presence of two different alleles forming a diploid pair is confirmed, for example by assigning one set of primers to one allele and another set of primers to the other allele (e.g., primers (by wavelengths) 2, 6 and 14 would assigned to one allele, while 1, 4 and 14 would be assigned to a second, and so on). Such “minisequencing” is described for example in Liljedahl, et al. (2004) BMC Biotechnology 2:24 (doi:10.1186/1472-6750-4-24); Shi (2001) “Enabling large-scale pharmacogenetic studies by high-throughput mutation detection and genotyping technologies” Clin. Chem. 47:164-72.
Exemplary datasets directed to similar problems have been performed, e.g., in Boess, et al. US Pat. App. 20040005547 “Biomarkers and expression profiles for toxicology”; Durham, et al. “Identification of biomarkers for liver toxicity” US Pat. App. 20040265889; Gut, et al. “Individual drug safety” US Pat. App. 20050037366.
Other methodologies may be applied, and may involve analyses which span multiple methodologies. See, e.g., Evans (2004) A Handbook of Bioanalysis and Drug Metabolism CRC Press, ISBN: 0415275199; Matson (2004) Applying Genomic and Proteomic Microarray Technology in Drug Discovery CRC Press; ISBN: 0849314690; and Albala and Humphery-Smith (2003) Protein Arrays, Biochips, and Proteomics Marcel Dekker, ISBN: 0824743121. Many techniques of analyses may be applied to proteomics or metabolomics, as described above.
Genetic analyses may include analyses of, e.g., quantitative DNA levels (genetic copy number; genetic duplication; genetic deletion; etc.), qualitative DNA features (polymorphisms; mutations; variations; regulatory features; and other features of structural or regulatory components), and other structural or functional DNA features (methylation, acetylation, other modifications or features). See, e.g., Fuchs and Podda (2004) Encyclopedia of Medical Genomics and Proteomics (2 vols.), Marcel Dekker, ISBN: 0824755618; Redei (2003) Encyclopedic Dictionary of Genetics, Genomics, and Proteomics (2d ed.), Wiley-Liss, ISBN: 0471268216; and Campbell and Heyer (2002) Discovering Genomics, Proteomics, and Bioinformatics (Bk and CD-Rom ed.) Benjamin Cummings, ISBN: 0805347224.
Haplotype analyses include, e.g., complete haplotype analyses in each individual (considering the entire complement of possible or related haplotypes or alleles exhibited across a population, including functionally related or other variant forms), analyses of genetic copy number and expression regulatory differences (gene duplications, gene amplification analyses), and particularly how specific haplotypes interact with other combinations of haplotypes or related haplotypes which affect biological function of a particular genotype. For example a “dominant” haplotype may be recessive to multiple copies of a “recessive” haplotype, and certain forms of gene dosage effects may depend upon gene copy numbers (e.g., in chromosome duplications) or regulatory segments controlling an allele. In particular, the phenotypic result from peculiar combinations of alleles may not comport with the simplistic Mendelian model of an inherent “dominance” or “recessiveness” of specific alleles. And in circumstances where kinetics are critical, e.g., in metabolic pathways, the flux of reactants may be affected by the reaction or turnover rates of the relevant (source/sink) enzymes, and the ultimate accumulation of particular reactants may depend upon the relative expression levels or turnover numbers of the respective producing or reacting enzymes. In many cases, problems in the turnover number of a particular expressed allele may be compensated by over/under expression of a different upstream or downstream enzymatic function, by activity of a modulating effector, or by the compensating expression of a different allele in the upstream or downstream functionality. Moreover, the presence of a different related (e.g., overlapping) enzymatic function may affect the impact of a specific allele. Thus, correlation with the whole spectrum of related functions will be valuable in understanding that simple genetic correlation to individual alleles will often fail to provide optimum correlation, and may often cause sufficiently poor correlation to effectively assist in therapeutic decision making.
Expression analyses, typically related to mRNA molecules, include, e.g., expression regulation (transcription; mRNA turnover; mRNA lifetime; translation effectors; and splice variants (many of which may exhibit different phenotypic function)). PCR or related methodologies may be used to qualitatively define and quantitate specific allelic forms.
Evaluation of proteins, e.g., by proteomic analyses, will often distinguish between forms which can exhibit different phenotypes, e.g., functional differences. Variants in sequence, amount, modification (glycosylation, phosphorylation, etc.), cofactors, subunit interactions (especially in multiprotein complexes), and such can result in differences in activity, function, or other phenotype.
Many methodologies to evaluate functional biological entities have been described, e.g., Walker (2005) Proteomics Protocols Handbook Humana Press, ISBN: 1588293432; Hamdan and Righetti (2005) Proteomics Today: Protein Assessment and Biomarkers Using Mass Spectrometry, 2D Electrophoresis, and Microarray Technology (Wiley-Interscience Series on Mass Spectrometry), Wiley-Interscience, ISBN: 0471648175; Simpson (2004) Purifying Proteins for Proteomics: A Laboratory Manual CSH Laboratory Press, ISBN: 0879696966; Cheng and Hammar (eds. 2004) Conformational Proteomics of Macromolecular Architecture: Approaching the Structure of Large Molecular Assemblies and Their Mechanisms of Action (with CD-Rom), World Sci. Pub., ISBN: 9812386157; Sanchez (ed. 2004) Biomedical Applications of Proteomics Wiley, ISBN: 3527308075; Twyman (2004) Principles Of Proteomics (Advanced Text Series), BIOS Sci. Pub., ISBN: 1859962734; Conn (2003) Handbook of Proteomic Methods Humana Press, ISBN: 1588293408; Westermeier, et al (2002) Proteomics in Practice: A Laboratory Manual of Proteome Analysis Wiley-VCH, ISBN: 3527303545; Simpson (2002) Proteins and Proteomics: A Laboratory Manual CSH Laboratory Press, ISBN: 0879695544; and Liebler (2001) Introduction to Proteomics: Tools for the New Biology Humana Press, ISBN: 0896039919.
For example, certain modifications (methylation, glycosylation, phosphorylation, acetylation, ubiquitination, and many others) to a protein may affect enzymatic turnover number, biological activity, half-life, turnover, substrate or other (e.g., regulatory) specificity or selectivity, temperature or other environmental sensitivity, enhancer or suppressor efficiency, and many other features which affect ultimate biological function in a particular context. See, e.g., Lee, et al. (2003) Drug Metabolism Enzymes Marcel Dekker, ISBN: 0824742931; and Testa and Mayer (2003) Hydrolysis in Drug and Prodrug Metabolism: Chemistry, Biochemistry, and Enzymology Wiley-VCH, ISBN: 390639025X. Chaparone proteins and protein conformation or folding scaffolds may be important, and often the functional unit will be other than specific genes or gene products. For example, ribosome function probably is dramatically affected by the many protein and RNA components which contribute to its structure.
In situ hybridization methods may evaluate the fidelity of organelle or cellular compartment localization, trafficking integrity and efficiency, and other gross biological functions which may sensitively distinguish or mediate various allelic differences.
Other methodologies exist, which may be direct or indirect, which may be used as surrogate means to evaluate particular features of the biological functions, e.g., PET scans, Gas Chromatography, other non-invasive or invasive spectroscopy or other methods which may be important to describe, which may be used to evaluate biochemical features which could be used herein. See, e.g., Korfmacher (2005) Using Mass Spectrometry for Drug Metabolism Studies CRC Press, ISBN: 0849319633; and Pfleger, et al (2000) Mass Spectral and GC Data of Drugs, Poisons, Pesticides, Pollutants and Their Metabolites (2d Ed.) Wiley-VCH, ISBN: 3527288805.
Potentially, accessible samples will often be most amenable to genetic analyses since samples may exist historically or in archival forms. Blood samples are reasonably common and easily collected. Typically DNA and related methodologies, e.g., PCR, hybridization, sequencing, restriction analyses, and the like, are usually “qualitative”, reasonably sensitive, and reasonably unambiguous in analysis. Moreover, sample handling tends to be simple and reproducible. Samples may be collected by non-invasive methods, e.g., hair/fingernail/skin/mucosal samples. Other non-invasive techniques such as X-ray, MRI, or related imaging methods; stool/urine/saliva/mucous samples; reproductive fluids; tears; exhalation; and external analytical methods may be used.
Statistical correlation analysis of phenotype with analyses will identify, in rank order, those markers exhibiting statistical correlation. Extending the analyses to correlate to haplotype combinations can also be performed where specific haplotypes or alleles are each evaluated. Simplistically, assuming haplotypes or alleles are only pairwise (e.g., diploid only), correlations with specific pairs can be evaluated for phenotype. Extending this further, correlations should include alternative combinations, including situation where one chromosome (or part thereof) may be duplicated, where gene dosage or dramatic regulatory effects may be evaluated, or where functionally equivalent alternative genetic sites may affect penetrance to phenotype.
In many situations, the haplotypes or markers will obviously indicate specific metabolic or enzymatic pathways or networks which correlate to phenotype. Alternatively, various pathways will emerge as being critical, and the members of the pathways or networks can be evaluated more closely.
In the analysis, certain patterns will be identified which account for most of the genetic variations contained in the target population (e.g., experiencing the particular effect). For example, in the context of genetic allele analysis for many genes, preferably the entire structural genome, the presence or absence of allele pairings is evaluated. The evaluation may take many forms, but the principal forms include Principal Component Analysis (PCA), various clustering techniques, supervised clustering techniques, and other statistical methods referred to above. These data will provide information which can be combined with systems biology and genomic cross species correlations to understand what networks and what members of these networks are likely targets to be useful signature factors. These factors will be those which are directly correlated to the features, typically combinations, which together define the phenotype.
Applying the Gene Ontology software on rat liver toxicity markers from Boess, et al. US Pat. App. 20040005547 “Biomarkers and expression profiles for toxicology” provides the results of Table 1.
Table 2A lists gene ID numbers for primate counterpart biomarker genes derived from the dataset from the Boess, et al. patent. Human and chimpanzee counterparts are identified, and other species can be similarly listed where sufficient information is available on the genome. The human subset 1 provides counterpart human genes, expressly listing the Entrez ID, the accepted symbol, and a short description of the gene corresponding to the marker. See http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene, which database is described, e.g., in Maglott, et al. (2005) Nucl. Acids Res. 33: database issue doi:10.1093/nar/gkiO31. The human subset 2 are Entrez gene IDs of genes which are reported to interact directly (e.g., by physical association or 2-hybrid interaction) with markers of subset 1, either reported from human or other species counterpart. The human subset 3 are Entrez gene IDs of markers which have been associated by being referred to in a published abstract with one of the markers of subset 1.
Similarly, chimpanzee counterparts are listed by Entrez gene ID numbers.
Table 2B lists Entrez gene ID numbers for counterparts in selected non-primate species. These are provided in dog, rat, and mouse. Similar counterparts can be generated for other species, as the genome sequences of additional species become more complete and counterpart equivalents can be determined. Typically the counterparts are assigned by sequence relatedness, genetic location in closely related species, or functional equivalence.
Drosophila); translocated to, 4
Additional datasets generated in rat, mouse, and dog have been similarly evaluated. These include data on TCDD (dioxin) toxicity in mouse liver reported in Boverhof, et al. (2005) “Temporal and dose-dependent hepatic gene expression patterns in mice provide new insights into TCDD-Mediated hepatotoxicity” Toxicol. Sci. 85:1048-1063; on acetaminophen (APAP) toxicity (hl) in rat liver reported in Heinloth, et al. (2004) “Gene expression profiling of rat livers reveals indicators of potential adverse effects” Toxicol. Sci. 80:193-202; on a number of compounds (ez) including dimethylnitrosamine, 2-nitrofluorene, aflatoxin B1, and 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone in rat liver reported in Ellinger-Ziegelbauer, et al. (2004) “Characteristic expression profiles induced by genotoxic carcinogens in rat liver” Toxicol. Sci. 77:19-34; and lipopolysaccharide (LPS) toxicity (hg) in dog liver reported in Higgins, et al. (2003) “Gene expression analysis of the acute phase response using a canine microarray” Toxicol. Sci. 74:470-484.
Table 3A lists Entrez gene ID numbers for primate counterpart biomarker genes derived from these datasets based on liver toxicity studies. Human and chimpanzee counterparts are identified, and other species can be similarly listed where sufficient information is available on the genome. The human subset 1 provides counterpart human genes, expressly listing the Entrez gene ID, the accepted symbol, and a short description of the gene corresponding to the marker. The primate subset 2 are Entrez gene IDs of markers which are reported to interact directly (e.g., by physical association or 2-hybrid interaction) with markers of subset 1, either reported from human or another species counterpart. The primate subset 3 are Entrez gene IDs of markers which have been associated by being referenced in a published abstract with one of the markers of subset 1.
Table 3B lists Entrez gene ID numbers for counterparts in selected non-primate species. These are provided in dog, rat, and mouse. Similar counterparts can be generated for other species, as their genome sequences become more complete and counterpart equivalents can be determined.
Table 4 lists markers which are found from analysis of multiple datasets. These markers have independent studies which commonly point to them as important, but not limited only to liver toxicity. Similar analyses can be performed for the additional markers, including those in Tables 5A/B and 6A/B.
Particularly preferred discriminatory markers are those which are listed in the various lists of Table 4, including combinations thereof, for evaluating or testing toxicity pathways. These are human markers which different independent studies commonly indicate are relevant discriminatory markers in liver toxicity. Primate counterparts will also be important, as the primate test systems are most likely relevant to clinical trials.
Other preferred embodiments are the Table 3A subset 2 and 3 for primate markers from each of the respective studies. The markers may be selected individually or in combinations within or across the studies (e.g., one or more from different datasets). Combinations of one or more of these markers can be generated from others from other tables. The Table 2A subsets 2 and 3 are similarly preferred, being primate markers. Individual markers, or combinations comprising markers from Table 2A subset 1 and Table 3A subset 1 are of interest, including combinations across various table subsets. Combinations including one or more markers from, e.g., the Table 3A subsets 2 and 3 with a member of Table 3A subset 1, or Table 2A or 2B, are also valuable.
Markers from non-primate species may also be important, e.g., Tables 2B and 3B. Individual or combinations within or across the subsets 2 and 3 are preferred. Combinations across the subsets 1 also have value.
Additional datasets generated in rat and mouse have been similarly evaluated, but not limiting them as markers only to liver toxicity. These include data on furan toxicity (f) in rat liver reported in Hamadeh, et al. (2004) “Integration of clinical and gene expression endpoints to explore furan-mediated hepatotoxicity” Mutat. Res. 549:169-183 (PMID 15120969); methapyriline toxicity (mp) in rat liver reported in Hamadeh, et al. (2002) “Methapyrilene toxicity: anchorage of pathologic observations to gene expression alterations” Toxicol. Pathol. 30:470-482 (PMID 1218793); various carcinogen compounds (i) on mouse liver reported in Iida, et al. (2005) “Unique patterns of gene expression changes in liver after treatment of mice for 2 weeks with different known carcinogens and non-carcinogens” Carcinogenesis 26:689-699 (PMID 15618236); and paclitaxel compounds (1) on rat liver reported in Lee, et al. (2004) “cDNA microarray gene expression analysis and toxicological phenotype for anticancer drug” J. Vet. Med. Sci. 66:1339-1345 (PMID 15585946). Data is presented in Tables 5A/B and 6A/B as described above, analogous to Tables 2A/B and 3A/B, with various preferred embodiments derived from subsets 3 and 2, and 1
Additional datasets have been identified and analyzed, besides the liver toxicity datasets described above. In particular, these include data on mouse aortic vascular smooth muscle cells (VSMC) in culture exposed to benzo[a]pyrene (BaP), a polycyclic aromatic hydrocarbon present in tobacco smoke inducing oxidative stress (j), which is reported in Johnson, et al. (2003) “Genomic profiles and predictive biological networks in oxidant-induced atherogenesis” Physiol. Genomics 13:263-275 (PMID: 12657712); data on cardiotoxicity responses to exposure to ditoxin, doxorubicin, isoproterenol, and/or LPS (cd) reported in Hirakawa, et al. (2005) “Method for determining cardiotoxicity” US Pat App 20050138675; a cardio polymorphisms dataset (cpm) reported in Cargill, et al. (2005) “Genetic polymorphisms associated with cardiovascular disorders and drug response, methods of detection and uses thereof” US Pat App 20050272054. A neurotoxicity dataset (dm) from sarin treatment of rats is reported in Damodaran, et al. (2006) “Gene expression profiles of the rat brain both immediately and 3 months following acute sarin exposure” Biochem. Pharmacol. 71:497-520 (PMID: 16376859). Two rat kidney toxicity datasets have been included, responses to treatment with sevoflurane products (k) reported by Kharasch, et al. (2006) “Gene expression profiling of nephrotoxicity from the sevoflurane degradation product fluoromethyl-2,2-difluoro-1-(trifluoromethyl)vinyl ether (“Compound A″) in rats” Toxicol. Sci. 90:419-431 (PMID 16384817); and with ochratoxin A (u) reported by Luhe, et al. (2003) “A new approach to studying ochratoxin A (OTA)-induced nephrotoxicity: expression profiling in vivo and in vitro employing cDNA microarrays” Toxicol. Sci. 73:315-328 (PMID 12700408). Analysis is included of human PBMC response to cigarette smoke condensate compounds (vl) as reported in van Leeuwen, et al. (2005) “Differential gene expression in human peripheral blood mononuclear cells induced by cigarette smoke and its constituents” Toxicol. Sci. 86:200-210 (PMID 15829617). Human keratinocyte culture response to various compounds (ae) is also analyzed, as reported by Bae, et al. (2003) “Gene expression patterns as potential molecular biomarkers for malignant transformation in human keratinocytes treated with MNNG, arsenic, or a metal mixture” Toxicol. Sci. 74:32-42 (PMID 12773770). And response of rat lung is analyzed (d) based on data reported by Dillman, et al. (2005) “Genomic analysis of rodent pulmonary tissue following bis-(2-chloroethyl) sulfide exposure” Chem Res Toxicol. 18:28-34 (PMID: 15651846).
These data are presented in Tables 5A/B and 6A/B, as in Tables 2A/B and 3A/B. Markers described in these tables, individual or in combinations, and including members of the various subsets, are embodiments of the invention. The markers are not limited to toxicity in the organ, cell type, or pathway where identified, e.g., the markers may be valuable in toxicity in other organs, cell types, or pathways.
Table 5A lists Entrez gene ID numbers for primate counterpart biomarker genes derived from these datasets based on toxicity studies. Human and chimpanzee counterparts are identified, and other species can be similarly listed where sufficient information is available on the genome. The human subset 1 provides counterpart human genes, expressly listing the Entrez gene ID, the accepted symbol, and a short description of the gene corresponding to the marker. The primate subset 2 are Entrez gene IDs of markers which are reported to interact directly (e.g., by physical association or 2-hybrid interaction) with markers of subset 1, either reported from human or another species counterpart. The primate subset 3 are Entrez gene IDs of markers which have been associated by being referenced in a published abstract with one of the markers of subset 1.
Table 5B lists Entrez gene ID numbers for counterparts in selected non-primate species. These are provided in dog, rat, and mouse. Similar counterparts can be generated for other species, as their genome sequences become more complete and counterpart equivalents can be determined.
Other preferred embodiments are the Table 5A subset 2 and 3 for primate markers from each of the respective studies. The markers may be selected individually or in combinations within or across the studies (e.g., one or more from different datasets). Combinations of one or more of these markers can be generated from others from other tables. The Table 5A subsets 2 and 3 are similarly preferred, being primate markers. Individual markers, or combinations comprising markers from Table 5A subset 1 and Table 6A subset 1 are of interest, including combinations across various table subsets. Combinations including one or more markers from, e.g., the Table 3A or 6A subsets 2 and 3 with a member of Tables 3A or 6A subset 1, or Table 2A or 2B, are also valuable. Combinations of markers from various Tables and subsets are incorporated herein.
Markers from non-primate species may also be important, e.g., Tables 2B, 3B, 5B and 6B. Individual or combinations within or across the subsets 2 and 3 are preferred. Combinations across the subsets 1 also have value.
Genomic profiling platforms have allowed collecting huge quantities of data from individual samples. With whole genome expression profiling, measurements of expression levels of most genes can be generated, providing, e.g., some 30K data points per sample. However, the interpretation of the expression data to a physiological outcome or response is difficult to achieve. If, e.g., only 10 organs were involved in a specific outcome, then some 30K genes over 10 organs would generate some 300K data points, which “reduce” down to a few alternative outcomes. The reduction of a 300K dimensional space to reduce down to 2-25 outcomes requires some dramatic pattern recognition capability.
The mathematical transformation from a 30K or 300K dimensional vector space down to the “critical” 3-5 outcome patterns is possible through the mathematics allowing identification of “eigen vectors”. The mathematical means to determine which markers are involved in the characteristic patterns are a first step. The goal it so identify what handful of “surrogate markers” can reproducibly and reliably correlate with the desired outcomes.
The data may be gene expression patterns, dynamic feature patterns, proteomic feature patterns, metabolomic measurement patterns, SNP evaluation patterns, or allele combination patterns. An advantage of using genetic based analyses is the lack of necessity to determine specifically where to evaluate the sample. Thus, genetic analysis will point to likely candidate markers whose number are sufficiently small to enormously reduce the complexity. With a much smaller universe of genes to investigate, coupled with large amounts of information on expression patterns, locations, disease contexts, systems biology, and the like, the difficulties of where, when, and how to look for the surrogate markers are greatly reduced.
The methodology to apply such to gene expression array analysis is described, e.g., in Najarian U.S. Pat. No. 6,996,476; Saffer, et al. U.S. Pat. No. 6,990,238; Parks and Moore U.S. Pat. No. 6,954,722; Lipshutz, et al. U.S. Pat. No. 6,953,663, and references contained therein, each of which is incorporated herein by reference. Outside of the context of gene expression, per se, e.g., in genotype, haplotype, or other high density characterization or based on other analysis means (e.g., proteomic or metabolomic analyses), earlier general mathematical methods analogous to those cited may be based on, e.g., Jackson (2003) A User's Guide to Principal Components Wiley-Interscience, ISBN 0471471348; Jolliffe (2002) Principal Component Analysis (2d Ed.) Springer, ISBN 0387954422; Diamantaras and Kung (1996) Principal Component Neural Networks: Theory and Applications Wiley-Interscience, ISBN 0471054364; Pearson (1901) “On Lines and Planes of Closest Fit to Systems of Points in Space” in The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science vol. 2:559-572, Sixth Series, July-December 1901; Hotelling (1933) “Analysis of a Complex of Statistical Variables into Principal Components” in The Journal of Educational Psychology 26:417-441 and 26:498-520; and Hotelling (1936) “Simplified Calculation of Principal Components” in Psychometrika 1:27-35.
Two basic categories of dimensionality reduction analyses are the supervised and unsupervised forms. See, e.g., Balakrishnama and Ganaphthiraju “Linear Discriminant Analysis: A Brief Tutorial” from the Department of Electrical and Computer Engineering, Institute for Signal and Information Processing, Mississippi State University. There are many algorithms for either method. Principal Component Analysis is an example of an unsupervised method, e.g., no assumptions are made regarding how many “patterns” are contained in the data. In contrast, the linear classifier (or linear discriminant) method is an example of supervised dimensionality reduction. For example, in a cancer compared to normal context, there are a priori categories (cancer, normal) and the effort is to identify a minimum set of genes or features that would discriminate between the two.
The complexity of large dimensionality datasets from substantially genome-wide analysis overwhelms the data analysis capabilities of many studies. Thus, the conclusions from these studies often fail to recognize much of the information resulting from the study. In particular, upon proper analysis of these forms of experimentation, many additional lessons can be derived. In particular, the use and application of the methods of the present invention to existing data can lead to specific hypotheses which may be easily tested. Alternatively, the reduction in complexity of the system can allow more practical and modest testing procedures to focus more efficiently on the important and relevant markers.
In fact, often animal models fail to correlate to human disease. Thus, the identification of animal markers often fails to transfer to surrogate markers in humans. However, with more sophisticated data analysis and systems understanding, the relationship between different genes can be much better predicted. However, reducing the complexity and scale of screening or testing will be very important.
Screening for biomarkers in a complex system often fails to identify critical subsets of markers. In some cases, the parameter of analysis is not relevant. For example, RNA levels may not be relevant if the activity of the protein product is regulated by other than transcriptional means, e.g., post-translational phosphorylation regulation. Alternatively, measurable parameters, e.g., expression levels of specific tRNAs, often may be only marginally or irrelevant to a particular physiological state. In other cases, the manifestation of symptoms may not correspond to the initiation of the physiological cause. For instance, a metabolic deficiency may manifest itself in varied and different symptoms, or manifesting itself in different forms depending upon the individual.
Thus, a screening strategy which does not assume a cause casts a wider net for causes than a focused study based on unfounded assumptions. Genetic analysis can often identify markers which in complex systems otherwise would never have been presumed to have effects. However, once a genetic component is identified, that element may lead to identification of related elements, whether the interaction is physical or functional. Entities which modulate the effect of that initial element are candidates for further study, and alternative diagnostic markers, surrogate markers, or even targets for therapy.
Identification of relevant markers can focus efforts to screen for appropriate aspects of those markers. Screening a limited number of markers for the correct modality for assay, e.g., genetic allele pairings, complement of related overlapping functions, RNA transcription, protein expression, post-translational features, and the like can be readily performed. In particular, dynamic aspects of expression or functional activity can be monitored. See, e.g., Troyanskaya (2005) Brief Bioinform. 6:34-43; Abraham, et al. (2004) “High content screening applied to large-scale cell biology” Trends Biotechnol. 22:15-22; Sauer (2004) Curr. Opin. Biotechnol. 15:58-63; Daub, et al. (2003) Bioinformatics 19:2332-343; Nikitin, et al. (2003) Bioinformatics 19:2155-167; and similar publications.
An example of related analysis occurs in genetic studies. Selectable traits are correlated with particular identifiable genes or haplotypes. However, a converse problem can be conceived where the selectable trait has some correlation with the presence or absence of a haplotype, but the “penetrance” of the haplotype is less than striking. Penetrance is described, e.g., in Houlston and Peto (2004) Oncogene 23:6471-6476; Hirschhorn, et al. (2002) Genetics in Medicine 4:45-61; Human Genome Program, U.S. Department of Energy (2003) “Genomics and Its Impact on Science and Society: A 2003 Primer”. In such a situation, additional genetic features which affect penetrance, e.g., other “accompanying” haplotypes or genes, can be identified using the correct tools. See, e.g., Terwilliger and Weiss (1998) Curr. Op. Biotechn. 9:578-594. The penetrance might be correlated with “combinations” of genetic markers rather than with single markers. The present invention helps to suggest the forms of those accompanying markers.
Alternatively, where enormous amounts of data have been generated, one can select and scrutinize particular markers for those which provide the most advantageous features. Thus, features which make diagnostic evaluation more reliable will be preferred over less reliable features. These features include, among others, commercially relevant considerations on accuracy, reliability, minimum frequencies of false results, whether positive or negative, cost, speed, ease of evaluation, etc.
Regulatory concerns may also be important, e.g., speed and cost of regulatory approval, cost effectiveness, and others.
A combination of these two methodologies, using smaller homogeneous population for initial marker identification, with scrutiny of selected fractions, e.g., less than 80, 70, 60, 50, 40, or 30% of markers will provide efficiencies in the screening.
One first correlation often will be to identify genetic markers which correlate with a particular outcome, e.g., efficacy, safety, toxicity, ADR. The various common haplotype or allele combinations can be evaluated to see which combinations are more problematic than others, e.g., whether various specific allele combinations correlate with greater or lesser response. With the identification of the markers, pathway analysis will point to other markers which are likely to similarly affect the main toxicity mechanism pathways. Those alternative markers will often also be tested to determine whether those also correlate with various outcomes, and alternatives may be screened for favorable diagnostic or clinical characteristics.
Attractive features for preferred biomarkers include, e.g., (1) stable background across individuals, time, and conditions; (2) high signal with low noise; (3) easy sample accessibility (preferably minimally invasive, perhaps blood, urine, tear, or surface sampling); (4) accurate, simple, rapid platform for analysis; (5) high statistical accuracy and correlation with outcome; (6) high reproducibility in measurements; (7) inexpensive and simple analytical methods.
Knowledge databases linking data within a species or across species will be useful. There are many such databases, e.g., products similar to those provided by Ingenuity, Entelos, BioVista, Jubilant BioSystems, ARIADNE Genomics, and the like. Similar offerings should be available from the NCBI website and others. See, e.g., http://www.genomicglossaries.com/content/lifesciences_databasesdirectory.asp, which refer to various specific databases, e.g., Metabolic Pathways of Biochemistry (www.gwu.edu), Kyoto Encyclopedia of Genes and Genomes (KEGG, www.genome.jp/kegg), ExPASys (www.expasy.ch), Enzymes and Metabolic Pathways Project (www.empproject.com), metacyc encyclopedia of metabolic pathways (metacyc.org), Alliance for Cellular Signaling (AfCS, www.cellularsignaling.org), biopathways consortium (www.biopathways.org), and similar sites.
From haplotypes which are correlated with defined pathways and networks, other members of those networks will be identified and their relevance to the pathway can be similarly inferred. From here, different strategies will be applied, depending partly on the difficulty of phenotype readout. Some phenotypes may be directly observed in experimental systems, either in vivo, or in vitro or subsystems. Other phenotypes will be analyzed to determine either surrogate markers, or similar signatures of important physiological or other precursor indicators of phenotype.
These further analyses may take many forms, e.g., directed to metabolic evaluations (metabolomic endpoints), structural endpoints (determining whether structural components of the respective subcellular or organ structure have effects), regulatory effects (e.g., using RNAi, antisense, genetic targeting, or other means to affect gene expression levels), gene mutational analyses, multiple endpoint analyses over time or other parameters, solvent effects, and others. These will provide a simplistic form of determining what effects are direct, indirect, or highly correlative with phenotype. Many different combinations may be used, and may provide alternative parameters with multidimensional readouts. This will generally provide a more reliable and stable diagnostic approach.
Perturbation analyses can be performed to identify particular nodes in the networks which are especially sensitive or stable to perturbation. See, e.g., Ramanathan, et al. (2005) “Perturbational Profiling of a Cell Line Model of Tumorigenesis by using Metabolic Measurements” Proc. Nat'l Acad. Sci. USA 102:5992-5997. The perturbation analyses described herein can be applied to signatures incorporating different parameters of measurement.
Correlation of biomarkers to phenotype occurs in various contexts. In particular, surrogate biomarkers are used as diagnostic of a phenotype, and may be useful, e.g., in in vitro systems (which correlate with the in vivo whole organism context), or in model systems to predict phenotype in a whole animal. See, e.g., McKim, et al. (2005) “A Biochemical Approach to In Vitro Toxicity Testing” Pharmaceutical Discovery, January 2005. An approach is described which successfully bridged the gap between cells and whole-animal effects by focusing on key functional endpoints that are not organ- or species-specific. The capability has been validated in animal studies. Certain biomarkers or endpoints may be used to predict cross-species results.
In the present invention, curve fitting of signatures back to phenotype of human response to administered drug (adverse drug response) is based on human responses. Experimental systems can be evaluated for surrogate markers which fit the human response outcome. Thus, cross species issues are minimized for whether results in a non-human species are relevant to humans.
Test systems may include in vitro and in vivo (animal models). These often will incorporate genetic features which increase susceptibility to, or simulate, a disease or physiological context. The systems may be used to test or screen chemical or biologic compounds for response with appropriate classifier markers, or for status of a toxicity pathway. Where the classifier markers are highly correlative with human response, the test system has high reliability. Thus, the test system may use a different specie background with selective components of human markers, or human with specific allelic combinations correlative with individuals in the human population. Moreover, the diversity of responses may be evaluated in the test system.
Conversely, test systems may have poor correlation with outcome in the intact organism. In such circumstances, the curve fitting to find features or appropriate parameters for accurate prediction should relate to outcome in the intact physiological context. Thus, using primate, and preferably human subjects, will allow determination of the genetic markers which promote penetrance.
Upon initial identification of correlating biomarkers, different signatures can be derived therefrom. For example, once identified in one species, it is likely that counterpart structural or functional entities can be found in closely related species. Systems biology and/or genomic data can be used to identify those counterparts, or other experimental methods may be applied to find counterparts in different species.
It will also often be true that functions closely related, including upstream or downstream functionalities, will also have statistical correlation with phenotype, and may move away from or towards critical checkpoints. Thus, it will be useful, once biomarkers are identified, to specifically check whether functionally adjacent biomarkers will show similar or greater correlation.
In addition, by looking at haplotype or allelic pair statistics, it should be possible to identify what pairings may attenuate or otherwise affect phenotype. Where multiple entities (e.g., different allelic variants) have similar or overlapping reaction specificities (e.g., for a substrate), it will be useful to see whether the phenotype also is dependent upon co-expression with related entities. In the situation of multisubunit complexes, the spectrum of alternative combinations of the respective components should be important, and explicitly investigated. For standard haplotype or allele combinations, some pairings may exhibit greater or lesser penetrance, and, e.g., phenotype should correlate more directly with combinations of paired alleles than with either allele alone.
When important signatures are identified, the dynamics of interactions or of metabolic conversions should be evaluated (e.g., metabolic fingerprints). The timing signature of the various biomarkers or metabolites should be carefully evaluated to identify temporal features which can be important in appearance of phenotype. Recognizing such features may also allow for dynamic monitoring of phenotype emergence.
The signatures may be in the intact organism, or may be model systems, e.g., in situ, in vitro, or subsystems. Once the appropriate network is identified, components thereof are identified as features to be evaluated for signatures. These features (specific species counterparts or variants) may also be targeted, e.g., for perturbation analysis, by substituting or incorporating various allelic forms into cells or systems as described, by transformation, deletion, etc. Allelic variants may take many forms, e.g., incorporating specific changes characteristic of natural variants, regulatory changes, enzymatic activity modulators, and the like. Such cell lines have been generated directed to cytochrome P450 pathway alleles or variants, e.g., incorporated into hepatocytes. See, e.g., Miners, et al. (2004) Ann. Rev. Pharmacol. Toxicol. 44:1-25; Rushmore and Kong (2002) Curr. Drug Metab. 3:481-490; Arora and Iversen (2001) Curr. Opin. Mol. Ther. 3:249-257; and Goldstein and de Morais (1994) Pharmacogenetics 4:285-299.
Often most relevant systems may be subsystems or cell lines placed in the context of a relevant disease model. The cell lines or systems may be genetically modified, e.g., incorporating different genes, differently regulated genes, combinations of human or relevant allelic forms, and the like. Human or other stem cells or their derivatives, e.g., differentiated organ systems may be useful. The systems may include physiologically or developmentally affected cells, cell lines, organs, tissues, or the like. And the systems may incorporate technologies relating to “laboratory on a chip” microfluidics or highly parallel or robotic systems. Individual chips, cells, lines, or systems may represent different individuals whose variability may be analogous to populational variability. Series of cells or systems may incorporate ranges of expression levels, attenuating, suppressing, competing, selective agents, and the like to evaluate ranges of responses to such perturbations.
A predictive model may be built that combines several of these “patterns”, and can be used experimentally to predict from the preclinical model the phenotype which will result upon actual administration to an intact animal. This is a “curve fitting” exercise, which correlates the pattern features to the effect endpoints, which can be a human population (e.g., in an ADR correlation). This can be used to validate a model, and help establish that the model does, using the identified signatures, correlate with human outcomes.
Testing of compounds which have been used historically will typically be simple in terms of in vitro or small animal testing. Similarly, where a history of use in humans has existed and records or tracking of those persons has been performed, e.g., in an ADR report, the human outcomes are known. These may be in ADR databases, or archived in clinical trials. Thus, defined populations with such data can be extremely useful to evaluate genetically, as described above, and by the methods described.
New clinical candidates (NCEs) or new compounds can be tested in the experimental systems described to get accurate, low cost readouts on important clinical outcomes. In early stages of program development, these experimental systems may be used to prioritize candidates, predict preclinical or clinical outcomes, or determine when medicinal chemistry efforts have accumulated sufficient candidates of acceptable properties. The systems described herein may allow better informed decision making for early stage research efforts. Certainly advances made in early decision making will minimize wastes from later expensive testing which confirms unsuitability for clinical development.
Various methods have been used to predict whole animal toxicity effects using in vitro experimental systems. See, e.g., McKim, et al. (January 2005) Pharmaceutical Discovery pp 30-36; (see www.Ceetox.com).
In a rat system, in vitro methods and measured parameters after exposure to various selected compounds have been correlated with outcomes from in vivo animal studies. These parameters have then been used to predict the outcome of administration of similar compounds to whole animal exposure.
In cellular in vitro assays, typically selected cell types are seeded into microtiter plates and allowed to grow and equilibrate for a period of time. Certain cells may be selected for increased or decreased function of relevant biomarkers, e.g., transmembrane pumps, cytochromes, important metabolizing enzymes, etc. Various mutants may be evaluated, including cells lacking or overexpressing particular functions, transformed with multiple copies of specific variants or alleles, or important combinations of functions, e.g., increased intake with decreased excretion.
The period for equilibration typically ranges from 12-48 hrs, depending, e.g., upon cell type, density, and other factors. Thereafter, the test compounds are applied to the cells across broad concentration ranges to near solubility limits, typically 2-4 logs, and generally across 1-300 microM. Negative and positive controls are included.
Following such exposure, the cells and supernatants are evaluated, typically for general cellular processes. These typically include some or all of membrane integrity, mitochondrial function, cell proliferation, glutathione biochemistry, membrane lipid peroxidation, apoptosis, DNA integrity, transporter function, adsorption or excretion activity, and general biological functions. Further pharmacology evaluation will typically include ADME, e.g., solubility measurement in the delivery medium, Pgp interaction, comparison of toxicity relative to related chemicals (structural or functional), subcellular targets of toxicity, and some component of structure-toxicity relationships. Other effects to evaluate include enzyme induction, metabolic activation, endocrine disruption, lipidosis, etc.
Various diagnostic markers derived from the methods described herein will be useful as surrogate markers. These can applied in specific model systems as signatures of phenotypic outcomes, e.g., phenotypes. When the signatures are accepted, e.g., from biological reasoning or by experimental demonstration, the phenotypes may be inferred from the appropriate signature. The models will then be useful as early screening tools for evaluating phenotype.
The predictions may be directed to subsetting potential patients into groups with predetermined likelihood of efficacious therapeutic treatment. Thus, e.g., some set of diagnostic features might subset target individuals into defined groups where there is high, questionable, or low likelihood of specific outcome (phenotype). While this may reflect better definition of clinical problem (e.g., distinguishing between various “forms” of a syndrome or of a “group” of similarly defined symptomatic conditions), it may provide for subsetting into groups likely to respond to a proposed therapeutic regimen.
Similarly, the predictions may be directed to subsetting potential patients into groups with predetermined likelihood of a particular effect of therapeutic treatment. Thus, e.g., some diagnostic signature might subset target individuals into defined groups where there is high, questionable, or low likelihood of specific outcome, particularly a positive or negative result, be it direct or indirect. This will particularly be valuable in the context of recognizing genetic or other predispositions to adverse side effects or intolerance to a particular proposed treatment. These factors may even be incorporated into the label of particular therapeutic entities or procedures.
All references cited herein are incorporated herein by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.
Having now fully described this invention, it will be appreciated by those skilled in the art that the same can be performed within a wide range of equivalent parameters, amounts, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While this invention has been described in connection with specific embodiments thereof, it will be understood that it is capable of further modifications. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains and as may be applied to the essential features hereinbefore set forth.
The specific embodiments described herein are offered by way of example, and the invention is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.
This application claims priority to U.S. Provisional Application Ser. No. 60/778,133, filed Mar. 1, 2006 and U.S. Provisional Application Ser. No. 60/675,741, filed Apr. 27, 2005. All the above cited United States provisional applications are expressly incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
60778133 | Mar 2006 | US | |
60675741 | Apr 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11380388 | Apr 2006 | US |
Child | 12572960 | US |