Leukemia is the most common childhood malignancy in the United States. Approximately 3,500 cases of acute leukemia are diagnosed each year in the U.S. in children less than 20 years of age. The large majority (>70%) of these cases are acute lymphoblastic leukemias (ALL) and the remainder acute myeloid leukemias (AML). The outcome for children with ALL has improved dramatically over the past three decades, but despite significant progress in treatment, 25% of children with ALL develop recurrent disease. Conversely, another 25% of children who now receive dose intensification are likely “over-treated” and may well be cured using less intensive regimens resulting in fewer toxicities and long term side effects. Thus, a major challenge for the treatment of children with ALL in the next decade is to improve and refine ALL diagnosis and risk classification schemes in order to precisely tailor therapeutic approaches to the biology of the tumor and the genotype of the host.
Leukemia in the first 12 months of life (referred to as infant leukemia) is extremely rare in the United States, with about 150 infants diagnosed each year. There are several clinical and genetic factors that distinguish infant leukemia from acute leukemias that occur in older children. First, while the percentage of acute lymphoblastic leukemia (ALL) cases is far more frequent (approximately five times) than acute myeloid leukemia in children from ages 1-15 years, the frequency of ALL and AML in infants less than one year of age is approximately equivalent. Secondly, in contrast to the extensive heterogeneity in cytogenetic abnormalities and chromosomal rearrangements in older children with ALL and AML, nearly 60% of acute leukemias in infants have chromosomal rerrangments involving the MLL gene (for Mixed Lineage Leukemia) on chromosome 11q23. MLL translocations characterize a subset of human acute leukemias with a decidedly unfavorable prognosis. Current estimates suggest that about 60% of infants with AML and about 80% of infants with ALL have a chromosomal rearrangment involving MLL abnormality in their leukemia cells. Whether hematopoietic cells in infants are more likely to undergo chromosomal rearrangements involving 11q13 or whether this 11q13 rearrangement reflects a unique environmental exposure or genetic susceptibliity remains to be determined.
The modern classification of acute leukemias in children and adults relies on morphologic and cytochemical features that may be useful in distinguishing AML from ALL, changes in the expression of cell surface antigens as a precursor cell differentiates, and the presence of specific recurrent cytogenetic or chromosomal rearrangements in leukemic cells. Using monoclonal antibodies, cell surface antigens (called clusters of differentiation (CD)) can be identified in cell populations; leukemias can be accurately classified by this means (immunophenotyping). By immunophenotyping, it is possible to classify ALL into the major categories of “common-CD10+B-cell precursor” (around 50%), “pre-B” (around 25%), “T” (around 15%), “null” (around 9%) and “B” cell ALL (around 1%). All forms other than T-ALL are considered to be derived from some stage of B-precursor cell, and “null” ALL is sometimes referred to as “early B-precursor” ALL.
Current risk classification schemes for ALL in children from 1-18 years of age use clinical and laboratory parameters such as patient age, initial white blood cell count, and the presence of specific ALL-associated cytogenetic abnormalities to stratify patients into “low,” “standard,” “high,” and “very high” risk categories. National Cancer Institute (NCI) risk criteria are first applied to all children with ALL, dividing them into “NCI standard risk” (age 1.00-9.99 years, WBC<50,000) and “NCI high risk” (age>10 years, WBC>50,000) based on age and initial white blood cell count (WBC) at disease presentation. In addition to these general NCI risk criteria, classic cytogenetic analysis and molecular genetic detection of frequently recurring cytogenetic abnormalities have been used to stratify ALL patients more precisely into “low,” “standard,” “high,” and “very high” risk categories.
These chromosomal aberrations primarily involve structural rearrangements (translocations) or numerical imbalances (hyperdiploidy—now assessed as specific chromosome trisomies, or hypodiploidy). Table 1 shows recurrent ALL genetic subtypes, their frequencies and their risk categorization.
The rate of disappearance of both B precursor and T ALL leukemic cells during induction chemotherapy (assessed morphologically or by other quantitative measures of residual disease) has also been used as an assessment of early therapeutic response and as a means of targeting children for therapeutic intensification (Gruhn et al., Leukemia 12:675-681, 1998; Foroni et al., Br. J. Haematol. 105:7-24, 1999; van Dongen et al., Lancet 352:1731-1738, 1998; Cavé et al., N. Engl. J. Med. 339:591-598, 1998; Coustan-Smith et al., Lancet 351:550-554, 1998; Chessells et al., Lancet 343:143-148, 1995; Nachman et al., N. Engl. J. Med. 338:1663-1671, 1998).
Children with “low risk” disease (22% of all B precursor ALL cases) are defined as having standard NCI risk criteria, the presence of low risk cytogenetic abnormalities (t(12;21)/TEL; AML1 or trisomies of chromosomes 4 and 10), and a rapid early clearance of bone marrow blasts during induction chemotherapy. Children with “standard risk” disease (50% of ALL cases) are NCI standard risk without “low risk” or unfavorable cytogenetic features, or, are children with low risk cytogenetic features who have NCI high risk criteria or slow clearance of blasts during induction. Although therapeutic intensification has yielded significant improvements in outcome in the low and standard risk groups of ALL, it is likely that a significant number of these children are currently “over-treated” and could be cured with less intensive regimens resulting in fewer toxicities and long term side effects. Conversely, a significant number of children even in these good risk categories still relapse and a precise means to prospectively identify them has remained elusive. Nearly 30% of children with ALL have “high” or “very high” risk disease, defined by NCI high risk criteria and the presence of specific cytogenetic abnormalities (such as t(1;19), t(9;22) or hypodiploidy) (Table 1); again, precise measures to distinguish children more prone to relapse in this heterogeneous group have not been established.
Despite these efforts, current diagnosis and risk classification schemes remain imprecise. Children with ALL more prone to relapse who require more intensive approaches and children with low risk disease who could be cured with less intensive therapies are not adequately predicted by current classification schemes and are distributed among all currently defined risk groups. Although pre-treatment clinical and tumor genetic stratification of patients has generally improved outcomes by optimizing therapy, variability in clinical course continues to exist among individuals within a single risk group and even among those with similar prognostic features. In fact, the most significant prognostic factors in childhood ALL explain no more than 4% of the variability in prognosis, suggesting that yet undiscovered molecular mechanisms dictate clinical behavior (Donadieu et al., Br J Haematol, 102:729-739, 1998). A precise means to prospectively identify such children has remained elusive.
The present invention is directed to methods for outcome prediction and risk classification in childhood leukemia. In one embodiment, the invention provides a method for classifying leukemia in a patient that includes obtaining a biological sample from a patient; determining the expression level for a selected gene product to yield an observed gene expression level; and comparing the observed gene expression level for the selected gene product to a control gene expression level. The control gene expression level can the expression level observed for the gene product in a control sample, or a predetermined expression level for the gene product. An observed expression level that differs from the control gene expression level is indicative of a disease classification. In another aspect, the method can include determining a gene expression profile for selected gene products in the biological sample to yield an observed gene expression profile; and comparing the observed gene expression profile for the selected gene products to a control gene expression profile for the selected gene products that correlates with a disease classification; wherein a similarity between the observed gene expression profile and the control gene expression profile is indicative of the disease classification.
The disease classification can be, for example, a classification based on predicted outcome (remission vs therapeutic failure); a classification based on karyotype; a classification based on leukemia subtype; or a classification based on disease etiology. Where the classification is based on disease outcome, the observed gene product is preferably a gene such as OPAL1, G1, G2, FYN binding protein, PBK1 or any of the genes listed in Table 42.
A novel gene, referred to herein as OPAL1, has been found to be strongly predictive of outcome in childhood leukemia, and presents new opportunities for better diagnosis, risk classification and better therapeutic options. Thus, in another embodiment, the invention includes a polynucleotide that encodes OPAL1 and variations thereof, the putative protein gene product of OPAL1 and variations thereof, and an antibody that binds to OPAL1, as well as host cells and vectors that include OPAL1.
The invention further provides for a method for predicting therapeutic outcome in a leukemia patient that includes obtaining a biological sample from a patient; determining the expression level for a selected gene product associated with outcome to yield an observed gene expression level; and comparing the observed gene expression level for the selected gene product to a control gene expression level for the selected gene product. The control gene expression level for the selected gene product can include the gene expression level for the selected gene product observed in a control sample, or a predetermined gene expression level for the selected gene product; wherein an observed expression level that is different from the control gene expression level for the selected gene product is indicative of predicted remission. Preferably, the selected gene product is OPAL1. Optionally, the method further comprises determining the expression level for another gene product, such as G1 or G2, and comparing in a similar fashion the observed gene expression level for the second gene product with a control gene expression level for that gene product, wherein an observed expression level for the second gene product that is different from the control gene expression level for that gene product is further indicative of predicted remission.
The invention further includes a method for detecting an OPAL1 polynucleotide in a biological sample which includes contacting the sample with an OPAL1 polynucleotide, or its complement, under conditions in which the polynucleotide selectively hybridizes to an OPAL1 gene; detecting hybridization of the polynucleotide to the OPAL1 gene in the sample. Likewise, the invention provides a method for detecting the OPAL1 protein in a biological sample that includes contacting the sample with an OPAL1 antibody under conditions in which the antibody selectively binds to an OPAL1 protein; and detecting the binding of the antibody to the OPAL1 protein in the sample. Pharmaceutical compositions including an therapeutic agent that includes an OPAL1 polynucleotide, polypeptide or antibody, together with a pharmaceutically acceptable carrier, are also included.
The invention further includes a method for treating leukemia comprising administering to a leukemia patient a therapeutic agent that modulates the amount or activity of the polypeptide associated with outcome. Preferably, the therapeutic agent increases the amount or activity of OPAL1.
Also provided by the invention is an in vitro method for screening a compound useful for treating leukemia. The invention further provides an in vivo method for evaluating a compound for use in treating leukemia. The candidate compounds are evaluated for their effect on the expression level(s) of one or more gene products associated with outcome in leukemia patients. Preferably, the gene product whose expression level is evaluated is the product of an OPAL1, G1, G2, FYN binding protein or PBK1 gene, or any of the genes listed in Table 42. More preferably, the gene product is a product of the OPAL1 gene.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawings will be provided by the Office upon request and payment of the necessary fee.
Gene expression profiling can provide insights into disease etiology and genetic progression, and can also provide tools for more comprehensive molecular diagnosis and therapeutic targeting. The biologic clusters and associated gene profiles identified herein are useful for refined molecular classification of acute leukemias as well as improved risk assessment and classification. In addition, the invention has identified numerous genes, including but not limited to the novel gene OPAL1 (also referred to herein as “G0”), G protein β2, related sequence 1 (also referred to herein as “G1”); IL-10 Receptor alpha (also referred to herein as “G2”), FYN-binding protein and PBK1, and the genes listed in Table 42 that are, alone or in combination, strongly predictive of outcome in pediatric ALL. The genes identified herein, and the proteins they encode, can be used to refine risk classification and diagnostics, to make outcome predictions and improve prognostics, and to serve as therapeutic targets in infant leukemia and pediatric ALL.
“Gene expression” as the term is used herein refers to the production of a biological product encoded by a nucleic acid sequence, such as a gene sequence. This biological product, referred to herein as a “gene product,” may be a nucleic acid or a polypeptide. The nucleic acid is typically an RNA molecule which is produced as a transcript from the gene sequence. The RNA molecule can be any type of RNA molecule, whether either before (e.g., precursor RNA) or after (e.g., mRNA) post-transcriptional processing. cDNA prepared from the mRNA of a sample is also considered a gene product. The polypeptide gene product is a peptide or protein that is encoded by the coding region of the gene, and is produced during the process of translation of the mRNA.
The term “gene expression level” refers to a measure of a gene product(s) of the gene and typically refers to the relative or absolute amount or activity of the gene product.
The term “gene expression profile” as used herein is defined as the expression level of two or more genes. Typically a gene expression profile includes expression levels for the products of multiple genes in given sample, up to 13,000 in the experiments described herein, preferably determined using an oligonucleotide microarray.
Unless otherwise specified, “a,” “an,” “the,” and “at least one” are used interchangeably and mean one or more than one.
Diagnosis, Prognosis and Risk Classification
Current parameters used for diagnosis, prognosis and risk classification in pediatric ALL are related to clinical data, cytogenetics and response to treatment. They include age and white blood count, cytogenetics, the presence or absence of minimal residual disease (MRD), and a morphological assessment of early response (measured as slow or rapid early therapeutic response). As noted above however, these parameters are not always well correlated with outcome, nor are they precisely predictive at diagnosis.
The present invention provides an improved method for identifying and/or classifying acute leukemias. Expression levels are determined for one or more genes associated with outcome, risk assessment or classification, karyotpe (e.g., MLL translocation) or subtype (e.g., ALL vs. AML; pre-B ALL vs. T-ALL. Genes that are particularly relevant for diagnosis, prognosis and risk classification according to the invention include those described in the tables and figures herein. The gene expression levels for the gene(s) of interest in a biological sample from a patient diagnosed with or suspected of having an acute leukemia are compared to gene expression levels observed for a control sample, or with a predetermined gene expression level. Observed expression levels that are higher or lower than the expression levels observed for the gene(s) of interest in the control sample or that are higher or lower than the predetermined expression levels for the gene(s) of interest provide information about the acute leukemia that facilitates diagnosis, prognosis, and/or risk classification and can aid in treatment decisions. When the expression levels of multiple genes are assessed for a single biological sample, a gene expression profile is produced.
In one aspect, the invention provides genes and gene expression profiles that are correlated with outcome (i.e., complete continuous remission vs. therapeutic failure) in infant leukemia and/or in pediatric ALL. Assessment of one or more of these genes according to the invention can be integrated into revised risk classification schemes, therapeutic targeting and clinical trial design. In one embodiment, the expression levels of a particular gene are measured, and that measurement is used, either alone or with other parameters, to assign the patient to a particular risk category. The invention identifies several genes whose expression levels, either alone or in combination, are associated with outcome, including but not limited to OPAL1/G0, G1, G2, PBK1 (Affymetrix accession no. 39418_at, DKFZP564M182 protein; GenBank No. AJ007398); FYN-binding protein (Affymetrix accession no. 41819_at, FYB-120/130; GenBank No. AF001862; da Silva, Proc. Nat'l. Acad. Sci. USA 94(14):7493-7498 (1997)); and the genes listed in Table 42. Some of these genes (e.g., OPAL1/G0) exhibit a positive association between expression level and outcome. For these genes, expression levels above a predetermined threshold level (or higher than that exhibited by a control sample) is predictive of a positive outcome. Our data suggests that direct measurement of the expression level of OPAL1/G0, optionally in conjunction with G1 and/or G2, can be used in refining risk classification and outcome prediction in pediatric ALL. In particular, it is expected such measurements can be used to refine risk classification in children who are otherwise classified as having low risk ALL, as well as to precisely identify children with high risk ALL who could be cured with less intensive therapies.
OPAL1/G0, in particular, is a very strong predictor for outcome. Our data suggest that OPAL1/G0 (alone and/or together with G1 and/or G2) may prove to be the dominant predictor for outcome in infant leukemia or pediatric ALL, more powerful than the current risk stratification standards of age and white blood count. OPAL1/G0 tends to be expressed at lower frequencies and lower overall levels in ALL cases with cytogenetic abnormalities associated with a poorer prognosis (such as t(9;22) and t(4;11)). Indeed, regardless of risk classification, cytogenetics or biological group, roughly the same outcome statistics are seen based upon the expression level of OPAL1/G0.
We found that higher OPAL1 expression distinguished ALL cases with good (OPAL1 high: 87% long term remission) versus poor outcome (OPAL1 low: 32% long term remission) in a statistically designed, retrospective pediatric ALL case control study (detailed below). Low OPAL1 was associated with induction failure (p=0.0036) while high OPAL1 was associated with long term event free survival (p=0.02), particularly in males (p=0.0004). OPAL1 was more frequently expressed at higher levels in cases with t(12;21), normal karyotype, and hyperdiploidy (better prognosis karyotypes) compared to t(1;19) or t(9;22) (poorer prognosis karyotypes). 86% of ALL cases with t(12;21) and high OPAL1 achieved long term remission in contrast to only 35% of t(12;21) cases with low OPAL1, suggesting that OPAL1 may be useful in prospectively identifying children who might benefit from further intensification. In ALL cases classified as high risk by the NCI criteria, 87% of those that exhibited high OPAL1 levels actually achieved long term remission, compared an overall long term remission outcome of 44% in this cohort. OPAL1 was also highly predictive of a favorable outcome in T ALL (p=0.02) and a similar trend was observed in a distinct infant ALL data set (see below). Thus, high OPAL1 levels are expected to be associated with long term remissions on standard, less intensive therapies, and conversely low OPAL1 levels, even in otherwise low risk ALL patients defined by current risk classification schemes, can identify children who require therapeutic intensification for cure.
For genes such as PBK1 whose expression levels are inversely correlated with outcome, observed expression levels above a predetermined threshold level (or higher than those observed in a control sample) are useful for classifying a patient into a higher risk category due to the predicted unfavorable outcome. Expression levels for multiple genes can be measured. For example, if normalized expression levels for OPAL1/G0, G1 and G2 are all high, a favorable outcome can be predicted with greater certainty.
The expression levels of multiple (two or more) genes in one or more lists of genes associated with outcome can be measured, and those measurements are used, either alone or with other parameters, to assign the patient to a particular risk category. For example, gene expression levels of multiple genes can be measured for a patient (as by evaluating gene expression using an Affymetrix microarray chip) and compared to a list of genes whose expression levels (high or low) are associated with a positive (or negative) outcome. If the gene expression profile of the patient is similar to that of the list of genes associated with outcome, then the patient can be assigned to a low (or high, as the case may be) risk category. The correlation between gene expression profiles and class distinction can be determined using a variety of methods. Methods of defining classes and classifying samples are described, for example, in Golub et al, U.S. Patent Application Publication No. 2003/0017481 published Jan. 23, 2003, and Golub et al., U.S. Patent Application Publication No. 2003/0134300, published Jul. 17, 2003. The information provided by the present invention, alone or in conjunction with other test results, aids in sample classification and diagnosis of disease.
Computational analysis using the gene lists and other data, such as measures of statistical significance, as described herein is readily performed on a computer. The invention should therefore be understood to encompass machine readable media comprising any of the data, including gene lists, described herein. The invention further includes an apparatus that includes a computer comprising such data and an output device such as a monitor or printer for evaluating the results of computational analysis performed using such data.
In another aspect, the invention provides genes and gene expression profiles that are correlated with cytogenetics. This allows discrimination among the various karyotypes, such as MLL translocations or numerical imbalances such as hyperdiploidy or hypodiploidy, which are useful in risk assessment and outcome prediction.
In yet another aspect, the invention provides genes and gene expression profiles that are correlated with intrinsic disease biology and/or etiology. In other words, gene expression profiles that are common or shared among individual leukemia cases in different patents can be used to define intrinsically related groups (often referred to as clusters) of acute leukemia that cannot be appreciated or diagnosed using standard means such as morphology, immunophenotype, or cytogenetics. Mathematical modeling of the very sharp peak in ALL incidence seen in children 2-3 years old (>80 cases per million) has suggested that ALL may arise from two primary events, the first of which occurs in utero and the second after birth (Linet et al., Descriptive epidemiology of the leukemias, in Leukemias, 5th Edition. E S Henderson et al. (eds). W B Saunders, Philadelphia. 1990). Interestingly, the detection of certain ALL-associated genetic abnormalities in cord blood samples taken at birth from children who are ultimately affected by disease supports this hypothesis (Gale et al., Proc. Natl. Acad. Sci. U.S.A., 94:13950-13954, 1997; Ford et al., Proc. Natl. Acad. Sci. U.S.A., 95:4584-4588, 1998).
Our results for both infant leukemia and pediatric ALL suggest that this disease is composed of novel intrinsic biologic clusters defined by shared gene expression profiles, and that these intrinsic subsets cannot be defined or predicted by traditional labels currently used for risk classification or by the presence or absence of specific cytogenetic abnormalities. We have identified 9 novel groups for pediatric ALL and 3 novel groups for infant leukemia using unsupervised learning methods for class discovery, and have used supervised learning methods for class prediction and outcome correlations that have identified candidate genes associated with classification and outcome. The gene expression profiles in the infant leukemia clusters provide some clues to novel and independent etiologies.
Some genes in these clusters are metabolically related, suggesting that a metabolic pathway that is associated with cancer initiation or progression. Other genes in these metabolic pathways, like the genes described herein but upstream or downstream from them in the metabolic pathway, thus can also serve as therapeutic targets.
In yet another aspect, the invention provides genes and gene expression profiles that discriminate acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL) in infant leukemias by measuring the expression levels of a gene product correlated with ALL or AML.
Another aspect of the invention provides genes and gene expression profiles that discriminate pre-B lineage ALL from T ALL in pediatric leukemias by measuring expression levels of a gene product correlated with pre-B lineage ALL or T ALL.
It should be appreciated that while the present invention is described primarily in terms of human disease, it is useful for diagnostic and prognostic applications in other mammals as well, particularly in veterinary applications such as those related to the treatment of acute leukemia in cats, dogs, cows, pigs, horses and rabbits.
Further, the invention provides methods for computational and statistical methods for identifying genes, lists of genes and gene expression profiles associated with outcome, karyotype, disease subtype and the like as described herein.
Measurement of Gene Expression Levels
Gene expression levels are determined by measuring the amount or activity of a desired gene product (i.e., an RNA or a polypeptide encoded by the coding sequence of the gene) in a biological sample. Any biological sample can be analyzed. Preferably the biological sample is a bodily tissue or fluid, more preferably it is a bodily fluid such as blood, serum, plasma, urine, bone marrow, lymphatic fluid, and CNS or spinal fluid. Preferably, samples containing mononuclear bloods cells and/or bone marrow fluids and tissues are used. In embodiments of the method of the invention practiced in cell culture (such as methods for screening compounds to identify therapeutic agents), the biological sample can be whole or lysed cells from the cell culture or the cell supernatant.
Gene expression levels can be assayed qualitatively or quantitatively. The level of a gene product is measured or estimated in a sample either directly (e.g., by determining or estimating absolute level of the gene product) or relatively (e.g., by comparing the observed expression level to a gene expression level of another samples or set of samples). Measurements of gene expression levels may, but need not, include a normalization process.
Typically, mRNA levels (or cDNA prepared from such mRNA) are assayed to determine gene expression levels. Methods to detect gene expression levels include Northern blot analysis (e.g., Harada et al., Cell 63:303-312 (1990)), S1 nuclease mapping (e.g., Fujita et al., Cell 49:357-367 (1987)), polymerase chain reaction (PCR), reverse transcription in combination with the polymerase chain reaction (RT-PCR) (e.g., Example III; see also Makino et al., Technique 2:295-301 (1990)), and reverse transcription in combination with the ligase chain reaction (RT-LCR). Multiplexed methods that allow the measurement of expression levels for many genes simultaneously are preferred, particularly in embodiments involving methods based on gene expression profiles comprising multiple genes. In a preferred embodiment, gene expression is measured using an oligonucleotide microarray, such as a DNA microchip, as described in the examples below. DNA microchips contain oligonucleotide probes affixed to a solid substrate, and are useful for screening a large number of samples for gene expression.
Alternatively or in addition, polypeptide levels can be assayed. Immunological techniques that involve antibody binding, such as enzyme linked immunosorbent assay (ELISA) and radioimmunoassay (RIA), are typically employed. Where activity assays are available, the activity of a polypeptide of interest can be assayed directly.
The observed expression levels for the gene(s) of interest are evaluated to determine whether they provide diagnostic or prognostic information for the leukemia being analyzed. The evaluation typically involves a comparison between observed gene expression levels and either a predetermined gene expression level or threshold value, or a gene expression level that characterizes a control sample. The control sample can be a sample obtained from a normal (i.e., non-leukemic patient) or it can be a sample obtained from a patient with a known leukemia. For example, if a cytogenic classification is desired, the biological sample can be interrogated for the expression level of a gene correlated with the cytogenic abnormality, then compared with the expression level of the same gene in a patient known to have the cytogenetic abnormality (or an average expression level for the gene that characterizes that population).
Treatment of Infant Leukemia and Pediatric ALL
The genes identified herein that are associated with outcome and/or specific disease subtypes or karyotypes are likely to have a specific role in the disease condition, and hence represent novel therapeutic targets. Thus, another aspect of the invention involves treating infant leukemia and pediatric ALL patients by modulating the expression of one or more genes described herein.
In the case of OPAL1/G0, whose increased expression above threshold values is associated with a positive outcome, the treatment method of the invention involves enhancing OPAL1/G0 expression. For a number of the gene products identified herein increased expression is correlated with positive outcomes in leukemia patients. Thus, the invention includes a method for treating leukemia, such as infant leukemia and/or pediatric ALL, that involves administering to a patient a therapeutic agent that causes an increase in the amount or activity of OPAL1/G0 and/or other polypeptides of interest that have been identified herein to be positively correlated with outcome. Preferably the increase in amount or activity of the selected gene product is at least 10%, preferably 25%, most preferably 100% above the expression level observed in the patient prior to treatment.
The therapeutic agent can be a polypeptide having the biological activity of the polypeptide of interest (e.g., an OPAL1/G0 polypeptide) or a biologically active subunit or analog thereof. Alternatively, the therapeutic agent can be a ligand (e.g., a small non-peptide molecule, a peptide, a peptidomimetic compound, an antibody, or the like) that agonizes (i.e., increases) the activity of the polypeptide of interest. For example, in the case of OPAL1/G0, which is postulated to be a membrane-bound protein that may function as a receptor or signaling molecule, the invention encompasses the use of a proline-rich ligand of the WW-binding protein 1 to agonize OPAL1/G0 activity.
Gene therapies can also be used to increase the amount of a polypeptide of interest, such as OPAL1/G0 in a host cell of a patient. Polynucleotides operably encoding the polypeptide of interest can be delivered to a patient either as “naked DNA” or as part of an expression vector. The term vector includes, but is not limited to, plasmid vectors, cosmid vectors, artificial chromosome vectors, or, in some aspects of the invention, viral vectors. Examples of viral vectors include adenovirus, herpes simplex virus (HSV), alphavirus, simian virus 40, picornavirus, vaccinia virus, retrovirus, lentivirus, and adeno-associated virus. Preferably the vector is a plasmid. In some aspects of the invention, a vector is capable of replication in the cell to which it is introduced; in other aspects the vector is not capable of replication. In some preferred aspects of the present invention, the vector is unable to mediate the integration of the vector sequences into the genomic DNA of a cell. An example of a vector that can mediate the integration of the vector sequences into the genomic DNA of a cell is a retroviral vector, in which the integrase mediates integration of the retroviral vector sequences. A vector may also contain transposon sequences that facilitate integration of the coding region into the genomic DNA of a host cell.
Selection of a vector depends upon a variety of desired characteristics in the resulting construct, such as a selection marker, vector replication rate, and the like. An expression vector optionally includes expression control sequences operably linked to the coding sequence such that the coding region is expressed in the cell. The invention is not limited by the use of any particular promoter, and a wide variety is known. Promoters act as regulatory signals that bind RNA polymerase in a cell to initiate transcription of a downstream (3′ direction) operably linked coding sequence. The promoter used in the invention can be a constitutive or an inducible promoter. It can be, but need not be, heterologous with respect to the cell to which it is introduced.
Another option for increasing the expression of a gene like OPAL1/G0 wherein higher expression levels are predictive for outcome is to reduce the amount of methylation of the gene. Demethylation agents, therefore, can be used to re-activate expression of OPAL/G0 in cases where methylation of the gene is responsible for reduced gene expression in the patient.
For other genes identified herein as being correlated without outcome in infant leukemia or pediatric ALL, high expression of the gene is associated with a negative outcome rather than a positive outcome. An example of this type of gene is PBK1. These genes (and their associated gene products) accordingly represent novel therapeutic targets, and the invention provides a therapeutic method for reducing the amount and/or activity of these polypeptides of interest in a leukemia patient. Preferably the amount or activity of the selected gene product is reduced to at least 90%, more preferably at least 75%, most preferably at least 25% of the gene expression level observed in the patient prior to treatment A cell manufactures proteins by first transcribing the DNA of a gene for that protein to produce RNA (transcription). In eukaryotes, this transcript is an unprocessed RNA called precursor RNA that is subsequently processed (e.g. by the removal of introns, splicing, and the like) into messenger RNA (mRNA) and finally translated by ribosomes into the desired protein. This process may be interfered with or inhibited at any point, for example, during transcription, during RNA processing, or during translation. Reduced expression of the gene(s) leads to a decrease or reduction in the activity of the gene product.
The therapeutic method for inhibiting the activity of a gene whose expression is correlated with negative outcome involves the administration of a therapeutic agent to the patient. The therapeutic agent can be a nucleic acid, such as an antisense RNA or DNA, or a catalytic nucleic acid such as a ribozyme, that reduces activity of the gene product of interest by directly binding to a portion of the gene encoding the enzyme (for example, at the coding region, at a regulatory element, or the like) or an RNA transcript of the gene (for example, a precursor RNA or mRNA, at the coding region or at 5′ or 3′ untranslated regions) (see, e.g., Golub et al., U.S. Patent Application Publication No. 2003/0134300, published Jul. 17, 2003). Alternatively, the nucleic acid therapeutic agent can encode a transcript that binds to an endogenous RNA or DNA; or encode an inhibitor of the activity of the polypeptide of interest. It is sufficient that the introduction of the nucleic acid into the cell of the patient is or can be accompanied by a reduction in the amount and/or the activity of the polypeptide of interest. An RNA aptamer can also be used to inhibit gene expression. The therapeutic agent may also be protein inhibitor or antagonist, such as small non-peptide molecule such as a drug or a prodrug, a peptide, a peptidomimetic compound, an antibody, a protein or fusion protein, or the like that acts directly on the polypeptide of interest to reduce its activity.
The invention includes a pharmaceutical composition that includes an effective amount of a therapeutic agent as described herein as well as a pharmaceutically acceptable carrier. Therapeutic agents can be administered in any convenient manner including parenteral, subcutaneous, intravenous, intramuscular, intraperitoneal, intranasal, inhalation, transdermal, oral or buccal routes. The dosage administered will be dependent upon the nature of the agent; the age, health, and weight of the recipient; the kind of concurrent treatment, if any; frequency of treatment; and the effect desired. A therapeutic agent identified herein can be administered in combination with any other therapeutic agent(s) such as immunosuppressives, cytotoxic factors and/or cytokine to augment therapy, see Golub et al, Golub et al., U.S. Patent Application Publication No. 2003/0134300, published Jul. 17, 2003, for examples of suitable pharmaceutical formulations and methods, suitable dosages, treatment combinations and representative delivery vehicles.
The effect of a treatment regimen on an acute leukemia patient can be assessed by evaluating, before, during and/or after the treatment, the expression level of one or more genes as described herein. Preferably, the expression level of gene(s) associated with outcome, such as OPAL1/G0, G1 and/or G2 are monitored over the course of the treatment period. Optionally gene expression profiles showing the expression levels of multiple selected genes associated with outcome can be produced at different times during the course of treatment and compared to each other and/or to an expression profile correlated with outcome.
Screening for Therapeutic Agents
The invention further provides methods for screening to identify agents that modulate expression levels of the genes identified herein that are correlated with outcome, risk assessment or classification, cytogenetics or the like. Candidate compounds can be identified by screening chemical libraries according to methods well known to the art of drug discovery and development (see Golub et al., U.S. Patent Application Publication No. 2003/0134300, published Jul. 17, 2003, for a detailed description of a wide variety of screening methods). The screening method of the invention is preferably carried out in cell culture, for example using leukemic cell lines that express known levels of the therapeutic target, such as OPAL1/G0. The cells are contacted with the candidate compound and changes in gene expression of one or more genes relative to a control culture are measured. Alternatively, gene expression levels before and after contact with the candidate compound can be measured. Changes in gene expression indicate that the compound may have therapeutic utility. Structural libraries can be surveyed computationally after identification of a lead drug to achieve rational drug design of even more effective compounds.
The invention further relates to compounds thus identified according to the screening methods of the invention. Such compounds can be used to treat infant leukemia and/or pediatric ALL, as appropriate, and can be formulated for therapeutic use as described above.
OPAL1 Polynucleotide, Polypeptide and Antibody
The invention includes novel nucleotide sequences found to be strongly associated with outcome in pediatric ALL, as well as the novel polypeptides they encode. These sequences, which we originally called “G0” but now have named OPAL1 for Outcome Predictor in Acute Leukemia, appear to be associated with alternatively spliced products of a large and complex gene. Alternate 5′ exon usage likely causes the production of more than one distinct protein from the genomic sequence. We have now fully cloned both the genomic and cDNA sequences (SEQ ID NO:16) of OPAL1. Expression levels of OPAL1/G0 that are high in relation to a predetermined threshold or a control sample are indicative of good prognosis.
Nucleotide sequences (SEQ ID NOs:1 and 3) encoding two alternatively spliced forms of the polypeptide gene product, OPAL1/G0, are shown in
The present invention also includes polypeptides with an amino acid sequence having at least about 80% amino acid identity, at least about 90% amino acid identity, or about 95% amino acid identity with SEQ ID NO:2 or 4. Amino acid identity is defined in the context of a comparison between an amino acid sequence and SEQ ID NO:2 or 4, and is determined by aligning the residues of the two amino acid sequences (i.e., a candidate amino acid sequence and the amino acid sequence of SEQ ID NO:2 or 4) to optimize the number of identical amino acids along the lengths of their sequences; gaps in either or both sequences are permitted in making the alignment in order to optimize the number of identical amino acids, although the amino acids in each sequence must nonetheless remain in their proper order. A candidate amino acid sequence is the amino acid sequence being compared to an amino acid sequence present in SEQ ID NO:2 or 4. A candidate amino acid sequence can be isolated from a natural source, or can be produced using recombinant techniques, or chemically or enzymatically synthesized. Preferably, two amino acid sequences are compared using the Blastp program of the BLAST 2 search algorithm, as described by Tatusova et al. (FEMS Microbiol. Lett., 174:247-250, 1999, and available on the world wide web at ncbi.nlm.nih.gov/gorf/b12.html). Preferably, the default values for all BLAST 2 search parameters are used, including matrix=BLOSUM62; open gap penalty=11, extension gap penalty=1, gap×dropoff=50, expect=10, wordsize=3, and optionally, filter on. In the comparison of two amino acid sequences using the BLAST2 search algorithm, amino acid identity is referred to as “identities.” A polypeptide of the present invention that has at least about 80% identity with SEQ ID NO:2 or 4 also has the biological activity of OPAL1/G0.
The polypeptides of this aspect of the invention also include an active analog of SEQ ID NO:2 or 4. Active analogs of SEQ ID NO:2 or 4 include polypeptides having amino acid substitutions that do not eliminate the ability to perform the same biological function(s) as OPAL1/G0. Substitutes for an amino acid may be selected from other members of the class to which the amino acid belongs. For example, nonpolar (hydrophobic) amino acids include alanine, leucine, isoleucine, valine, proline, phenylalanine, tryptophan, and tyrosine. Polar neutral amino acids include glycine, serine, threonine, cysteine, tyrosine, aspartate, and glutamate. The positively charged (basic) amino acids include arginine, lysine, and histidine. The negatively charged (acidic) amino acids include aspartic acid and glutamic acid. Such substitutions are known to the art as conservative substitutions. Specific examples of conservative substitutions include Lys for Arg and vice versa to maintain a positive charge; Glu for Asp and vice versa to maintain a negative charge; Ser for Thr so that a free —OH is maintained; and Gln for Asn to maintain a free NH2.
Active analogs, as that term is used herein, include modified polypeptides. Modifications of polypeptides of the invention include chemical and/or enzymatic derivatizations at one or more constituent amino acids, including side chain modifications, backbone modifications, and N- and C-terminal modifications including acetylation, hydroxylation, methylation, amidation, and the attachment of carbohydrate or lipid moieties, cofactors, and the like.
The present invention further includes polynucleotides encoding the amino acid sequence of SEQ ID NO:2 or 4. An example of the class of nucleotide sequences encoding the polypeptide having SEQ ID NO:2 is SEQ ID NO:1; and an example of the class of nucleotide sequences encoding the polypeptide having SEQ ID NO:4 is SEQ ID NO:3. The other nucleotide sequences encoding the polypeptides having SEQ ID NO:2 or 4 can be easily determined by taking advantage of the degeneracy of the three letter codons used to specify a particular amino acid. The degeneracy of the genetic code is well known to the art and is therefore considered to be part of this disclosure. The classes of nucleotide sequences that encode SEQ ID NO:2 and 4 are large but finite, and the nucleotide sequence of each member of the classes can be readily determined by one skilled in the art by reference to the standard genetic code.
The present invention also includes polynucleotides with a nucleotide sequence having at least about 90% nucleotide identity, at least about 95% nucleotide identity, or about 98% nucleotide identity with SEQ ID NO:1 or 3. Nucleotide identity is defined in the context of a comparison between an nucleotide sequence and SEQ ID NO:1 or 3, and is determined by aligning the residues of the two nucleotide sequences (i.e., a candidate nucleotide sequence and the nucleotide sequence of SEQ ID NO:1 or 3) to optimize the number of identical nucleotides along the lengths of their sequences; gaps in either or both sequences are permitted in making the alignment in order to optimize the number of identical nucleotides, although the nucleotides in each sequence must nonetheless remain in their proper order. A candidate nucleotide sequence is the nucleotide sequence being compared to an nucleotide sequence present in SEQ ID NO:2 or 4. A candidate nucleotide sequence can be isolated from a natural source, or can be produced using recombinant techniques, or chemically or enzymatically synthesized. Percent identity is determined by aligning two polynucleotides to optimize the number of identical nucleotides along the lengths of their sequences; gaps in either or both sequences are permitted in making the alignment in order to optimize the number of shared nucleotides, although the nucleotides in each sequence must nonetheless remain in their proper order. For example, the two nucleotide sequences are readily compared using the Blastn program of the BLAST 2 search algorithm, as described by Tatusova et al. (FEMS Microbiol. Lett., 174:247-250, 1999). Preferably, the default values for all BLAST 2 search parameters are used, including reward for match=1, penalty for mismatch=−2, open gap penalty=5, extension gap penalty=2, gap x_dropoff=50, expect=10, wordsize=11, and filter on.
Examples of polynucleotides encoding a polypeptide of the present invention also include those having a complement that hybridizes to the nucleotide sequence SEQ ID NO:1 or 3 under defined conditions. The term “complement” refers to the ability of two single stranded polynucleotides to base pair with each other, where an adenine on one polynucleotide will base pair to a thymine on a second polynucleotide and a cytosine on one polynucleotide will base pair to a guanine on a second polynucleotide. Two polynucleotides are complementary to each other when a nucleotide sequence in one polynucleotide can base pair with a nucleotide sequence in a second polynucleotide. For instance, 5′-ATGC and 5′-GCAT are complementary. As used herein, “hybridizes,” “hybridizing,” and “hybridization” means that a single stranded polynucleotide forms a noncovalent interaction with a complementary polynucleotide under certain conditions. Typically, one of the polynucleotides is immobilized on a membrane. Hybridization is carried out under conditions of stringency that regulate the degree of similarity required for a detectable probe to bind its target nucleic acid sequence. Preferably, at least about 20 nucleotides of the complement hybridize with SEQ ID NO:1 or 3, more preferably at least about 50 nucleotides, most preferably at least about 100 nucleotides.
Also provided by the invention is an OPAL1/G0 antibody, or antigen-binding portion thereof, that binds the novel protein OPAL1/G0. OPAL1/G0 antibodies can be used to detect OPAL1/G0 protein; they are also useful therapeutically to modulate expression of the OPAL1/G0 gene. An antibody may be polyclonal or monoclonal. Methods for making polyclonal and monoclonal antibodies are well known to the art. Monoclonal antibodies can be prepared, for example, using hybridoma techniques, recombinant, and phage display technologies, or a combination thereof. See Golub et al., U.S. Patent Application Publication No. 2003/0134300, published Jul. 17, 2003, for a detailed description of the preparation and use of antibodies as diagnostics and therapeutics.
Preferably the antibody is a human or humanized antibody, especially if it is to be used for therapeutic purposes. A human antibody is an antibody having the amino acid sequence of a human immunoglobulin and include antibodies produced by human B cells, or isolated from human sera, human immunoglobulin libraries or from animals transgenic for one or more human immunoglobulins and that do not express endogenous immunoglobulins, as described in U.S. Pat. No. 5,939,598 by Kucherlapati et al., for example. Transgenic animals (e.g., mice) that are capable, upon immunization, of producing a full repertoire of human antibodies in the absence of endogenous immunoglobulin production can be employed. For example, it has been described that the homozygous deletion of the antibody heavy chain joining region (J(H)) gene in chimeric and germ-line mutant mice results in complete inhibition of endogenous antibody production. Transfer of the human germ-line immunoglobulin gene array in such germ-line mutant mice will result in the production of human antibodies upon antigen challenge (see, e.g., Jakobovits et al., Proc. Natl. Acad. Sci. U.S.A., 90:2551-2555 (1993); Jakobovits et al., Nature, 362:255-258 (1993); Bruggemann et al., Year in Immuno., 7:33 (1993)). Human antibodies can also be produced in phage display libraries (Hoogenboom et al., J. Mol. Biol., 227:381 (1991); Marks et al., J. Mol. Biol., 222:581 (1991)). The techniques of Cote et al. and Boerner et al. are also available for the preparation of human monoclonal antibodies (Cole et al., Monoclonal Antibodies and Cancer Therapy, Alan R. Liss, p. 77 (1985); Boerner et al., J. Immunol., 147(1):86-95 (1991)).
Antibodies generated in non-human species can be “humanized” for administration in humans in order to reduce their antigenicity. Humanized forms of non-human (e.g., murine) antibodies are chimeric immunoglobulins, immunoglobulin chains or fragments thereof (such as Fv, Fab, Fab′, F(ab′)2, or other antigen-binding subsequences of antibodies) which contain minimal sequence derived from non-human immunoglobulin. Residues from a complementary determining region (CDR) of a human recipient antibody are replaced by residues from a CDR of a non-human species (donor antibody) such as mouse, rat or rabbit having the desired specificity. Optionally, Fv framework residues of the human immunoglobulin are replaced by corresponding non-human residues. See Jones et al., Nature, 321:522-525 (1986); Riechmann et al., Nature, 332:323-327 (1988); and Presta, Curr. Op. Struct. Biol., 2:593-596 (1992). Methods for humanizing non-human antibodies are well known in the art. See Jones et al., Nature, 321:522-525 (1986); Riechmann et al., Nature, 332:323-327 (1988); Verhoeyen et al., Science, 239:1534-1536 (1988); and (U.S. Pat. No. 4,816,567).
Laboratory Applications
The present invention further includes a microchip for use in clinical settings for detecting gene expression levels of one or more genes described herein as being associated with outcome, risk classification, cytogenics or subtype in infant leukemia and pediatric ALL. In a preferred embodiment, the microchip contains DNA probes specific for the target gene(s). Also provided by the invention is a kit that includes means for measuring expression levels for the polypeptide product(s) of one or more such genes, preferably OPAL/G0, G1, G2, FYN binding protein, PBK1, or any of the genes listed in Table 42. In a preferred embodiment, the kit is an immunoreagent kit and contains one or more antibodies specific for the polypeptide(s) of interest.
The present invention is illustrated by the following examples. It is to be understood that the particular examples, materials, amounts, and procedures are to be interpreted broadly in accordance with the scope and spirit of the invention as set forth herein
Leukemia Blast Purification, RNA Isolation, Amplification and Hybridization to Oligonucleotide Arrays
Laboratory techniques were developed to optimize sample handling and processing for high quality microarray studies for gene expression profiling in leukemia samples. Reproducible methods were developed for leukemia blast purification, RNA isolation, linear amplification, and hybridization to oligonucleotide arrays. Our optimized approach is a modification of a double amplification method originally developed by Ihor Lemischka and colleagues from Princeton University (Ivanova et al., Science 298(5593):601-604 (2002)).
Total RNA was isolated from leukemic blasts using Qiagen Rneasy. An average of 2×107 cells were used for total RNA extraction with the Qiagen RNeasy mini kit (Valencia, Calif.). The yield and integrity of the purified total RNA were assessed with the RiboGreen assay (Molecular Probes, Eugene, Oreg.) and the RNA 6000 Nano Chip (Agilent Technologies, Palo Alto, Calif.), respectively.
Complementary RNA (cRNA) target was prepared from 2.5 μg total RNA using two rounds of Reverse Transcription (RT) and In Vitro Transcription (IVT). Following denaturation for 5 minutes at 70° C., the total RNA was mixed with 100 pmol T7-(dT)24 oligonucleotide primer (Genset Oligos, La Jolla, Calif.) and allowed to anneal at 42° C. The mRNA was reverse transcribed with 200 units Superscript II (Invitrogen, Grand Island, N.Y.) for 1 hour at 42° C. After RT, 0.2 volume 5× second strand buffer, additional dNTP, 40 units DNA polymerase I, 10 units DNA ligase, 2 units RnaseH (Invitrogen) were added and second strand cDNA synthesis was performed for 2 hours at 16° C. After T4 DNA polymerase (10 units), the mix was incubated an additional 10 minutes at 16° C. An equal volume of phenol:chloroform:isoamyl alcohol (25:24:1)(Sigma, St. Louis, Mo.) was used for enzyme removal. The aqueous phase was transferred to a microconcentrator (Microcon 50. Millipore, Bedford, Mass.) and washed/concentrated with 0.5 ml DEPC water twice the sample was concentrated to 10-20 ul. The cDNA was then transcribed with T7 RNA polymerase (Megascript, Ambion, Austin, Tex.) for 4 hr at 37° C. Following IVT, the sample was phenol:chloroform:isoamyl alcohol extracted, washed and concentrated to 10-20 ul.
The first round product was used for a second round of amplification which utilized random hexamer and T7-(dT)24 oligonucleotide primers, Superscript II, two RNase H additions, DNA polymerase I plus T4 DNA polymerase finally and a biotin-labeling high yield T7 RNA polymerase kit (Enzo Diagnostics, Farmingdale, N.Y.). The biotin-labeled cRNA was purified on Qiagen RNeasy mini kit columns, eluted with 50 ul of 45° C. RNase-free water and quantified using the RiboGreen assay.
Following RNA isolation and cRNA amplification using two rounds of poly dT primer-anchored Reverse Transcription and T7 RNA polymerase transcription, RNA and cRNA quality was assessed by capillary electrophoresis on Agilent RNA Lab-Chips. After the quality check on Agilent Nano 900 Chips, 15 ug cRNA were fragmented following the Affymetrix protocol (Affymetrix, Santa Clara, Calif.). The fragmented RNA was then hybridized for 20 hours at 45° C. to HG_U95Av2 probes. The hybridized probe arrays were washed and stained with the EukGE_WS2 fluidics protocol (Affymetrix), including streptavidin phycoerythrin conjugate (SAPE, Molecular Probes, Eugene, Oreg.) and an antibody amplification step (Anti-streptavidin, biotinylated, Vector Labs, Burlingame, Calif.). HG_U95Av2 chips were scanned at 488 nm, as recommended by Affymetrix. The expression value of each gene was calculated using Affymetrix Microarray Suite 5.0 software.
We routinely obtain 100-200 micrograms of amplified cRNA from 2.5 micrograms of leukemia cell-derived total RNA. Our detailed statistical analysis comparing various RNA inputs and single vs. double amplification methods have shown that this approach leads to an excellent representation of low as well as high abundance mRNAs and is highly reproducible. It has the added benefit of not losing the representation of low abundance genes frequently lost in methods that lack amplification or only perform single round amplifications. As only 15 micrograms of cRNA are required per Affymetrix chip, we are able to store residual cRNA in virtually all cases; this highly valuable cRNA can be used again in the future as array platforms and methods of analysis improve. Samples were studied using oligonucleotide microarrays containing 12,625 probes (Affymetrix U95Av2 array platform).
Statistical Design
We designed two retrospective cohorts of pediatric ALL patients registered to clinical trials previously coordinated by the Pediatric Oncology Group (POG): 1) a cohort 127 infant leukemias (the “infant” data set); and 2) a case control study of 254 pediatric B-precursor and T cell ALL cases (the “preB” dataset). These samples were obtained from patients with long term follow up who were registered to clinical trials completed by the Pediatric Oncology Group (POG). In the analysis of gene expression profiles for classification and particularly outcome prediction, it is essential to integrate gene expression data with laboratory parameters that impact the quality of the primary data, and to make sure that any derived cluster or gene list cannot be accounted for by variations in laboratory methodology. Thus we tracked and annotated our gene expression data set with all of the laboratory correlates shown below.
Laboratory Correlates
For the retropective “infant” study, 142 retrospective cases from two POG infant trials (9407 for infant ALL; 9421 for infant AML) were initially chosen for analysis. Infants as defined were <365 days in age and had overall extremely poor survival rates (<25%). Of the 142 cases, 127 were ultimately retained in the study; 15 cases were excluded from the final analysis due to poor quality total RNA, cRNA amplification, or hybridization. Of the final 127 cases analyzed, 79 were considered traditional ALL by morphology and immunophenotyping and 48 were considered AML. 59/127 of these cases had rearrangements of the MLL gene.
The 254 member retrospective pre-B and T cell ALL case control study (the “preB” study) was selected from a number of pediatric POG clinical trials. A cohort design was developed that could compare and contrast gene expression profiles in distinct cytogenetic subgroups of ALL patients who either did or did not achieve a long term remission (for example comparing children with t(4;11) who failed vs. those who achieved long term remission). Such a design allowed us to compare and contrast the gene expression profiles associated with different outcomes within each genetic group and to compare profiles between different cytogenetic abnormalities. The design was constructed to look at a number of small independent case-control studies within B precursor ALL and T cell ALL. For the B cell ALL group, the representative recurrent translocations included t(4;11), t(9;22), t(1;19), monosomy 7, monosomy 21, Females, Males, African American, Hispanic, and AlinC15 arm A. Cases were selected from several completed POG trials, but the majority of cases came from the POG 9000 series, including 8602, 9406, 9005, and 9006 as long term follow up was available.
As standard cytogenetic analysis of the samples from patients registered to these older trials would not have usually detected the t(12;21), we performed RT-PCR studies on a large cohort of these cases to select ALL cases with t(12;21) who either failed (n=8) therapy or achieved long term remissions (n=22). Cases who “failed” had failed within 4 years while “controls” had achieved a complete continuous remission of 4 or more years. A case-control study of induction failures (cases) vs. complete remissions (CRs; controls) was also included in this cohort design as was a T cell cohort.
It is very important to recognize that the study was designed for efficiency, and maximum overlap, without adversely affecting the random sampling assumptions for the individual case-control studies. To design this cohort, the set of all patients (irrespective of study) who had inventory in the UNM POG/COG Tissue Repository and who had failed within 4 years of diagnosis (cases) were considered. Each such case was assigned a random number from zero to one. Cases were then sorted by this random number. The same process was applied to the totality of potential controls. For each case-control study, we then took the first N patients (requested in design) or all patients (whichever was smaller), meeting the entry requirements for the particular study. By maximizing the overlap in this fashion, a savings of over 20% compared to a design that required mutually exclusive entries was achieved. Yet for any given case-control study, the patients represent pure random samples of cases and controls. (For example if the first patient in the sort of the failure group were an African-American female with a t(1;19) translocation, she would participate in at least three case control studies). As for the infant leukemia cases, gene expression arrays were completed using 2.5 micrograms of RNA per case (all samples had >90% blasts) with double linear amplification. All amplified RNAs were hybridized to Affymetrix U95A.v2 chips.
The present invention makes use of a suite of high-end analytic tools for the analysis of gene expression data. Many of these represent novel implementations or significant extensions of advanced techniques from statistical and machine learning theory, or new data mining approaches for dealing with high-dimensional and sparse datasets. The approaches can be categorized into two major groups: knowledge discovery environments, and supervised classification methodologies.
Clustering, Visualization, and Text-Mining
1. VxInsight
VxInsight is a data mining tool (Davidson et al., J. Intellig. Inform. Sys. 11:259-285, 1998; Davidson et al., IEEE Information Visualization 2001, 23-30, 2001) originally developed to cluster and organize bibliographic databases, which has been extended and customized for the clustering and visualization of genomic data. It presents an intuitive way to cluster and view gene expression data collected from microarray experiments (Kim et al., Science 293:2087-92, 2001). It can be applied equally to the clustering of genes (e.g., in a time-series experiment) or to discover novel biologic clusters within a cohort of leukemia patient samples. Similar genes or patients are clustered together spatially and represented with a 3D terrain map, where the large mountains represent large clusters of similar genes/samples and smaller hills represent clusters with fewer genes/samples. The terrain metaphor is extremely intuitive, and allows the user to memorize the “landscape,” facilitating navigation through large datasets.
VxInsight's clustering engine, or ordination program, is based on a force-directed graph placement algorithm that utilizes all of the similarities between objects in the dataset. When applied to gene clustering, for example, the algorithm assigns genes into clusters such that the sum of two opposing forces is minimized. One of these forces is repulsive and pushes pairs of genes away from each other as a function of the density of genes in the local area. The other force pulls pairs of similar genes together based on their degree of similarity. The clustering algorithm terminates when these forces are in equilibrium. User-selected parameters determine the fineness of the clustering, and there is a tradeoff with respect to confidence in the reliability of the cluster versus further refinement into sub-clusters that may suggest biologically important hypotheses.
VxInsight was employed to identify clusters of infant leukemia patients with similar gene expression patterns, and to identify which genes strongly contributed to the separations. A suite of statistical analysis tools was developed for post-processing information gleaned from the VxInsight discovery process. Visual and clustering analyses generated gene lists, which when combined with public databases and research experience, suggest possible biological significance for those clusters. The array expression data were clustered by rows (similar genes clustered together), and by columns (patients with similar gene expression clustered together). In both cases Pearson's R was used to estimate the similarities. Analysis of variance (ANOVA) was used to determine which genes had the strongest differences between pairs of patient clusters. These gene lists were sorted into decreasing order based on the resulting F-scores, and were presented in an HTML format with links to the associated OMIM pages (Online Mendelian Inheritance in Man database, available on the world wide web through the National Center for Biotechnology Information), which were manually examined to hypothesize biological differences between the clusters. Gene list stability was investigated using statistical bootstraps (Efron, Ann. Statist. 7:1-26, 1979; Hjorth et al., Computer Intensive Statistical Methods, Validation Model Selection and Bootstrap. Chapman & Hall, London, 1994). For each pair of clusters 100 random bootstrap cases were constructed via resampling with replacement from the observed expressions (
2. Principal Component Analysis
Principal component analysis (PCA) is a well-known and convenient method for performing unsupervised clustering of high-dimensional data. Closely related to the Singular Value Decomposition (SVD), PCA is an unsupervised data analysis technique whereby the most variance is captured in the least number of coordinates. It can serve to reduce the dimensionality of the data while also providing significant noise reduction. It is a standard technique in data analysis and has been widely applied to microarray data. Recently (Raychaudhuri et al., Pac. Symp. Biocomput., 5:455-466, 2002) PCA was used to analyze cell cycles in yeast (Chu et al., Science, 282:699-705, 1998; Spellman et al., Mol. Biol. Cell, 9:3273-97, 1998); PCA has also been applied to clustering (Hastie et al., Genome Biology 1:research0003, 2000; Holter et al., Proc. Natl. Acad. Sci., 97:8409-14, 2000); other applications of PCA to microarray data have been suggested (Wall et al., Bioinformatics 17, 566-568, 2001).
PCA works by providing a statistically significant projection of a dataset onto an orthonormal basis. This basis is computed so that a variety of quantities are optimized. In particular we have (Kirby, Geometric Data Analysis. John Wiley & Sons, New York, 2001):
The Bayesian network modeling and learning paradigm (Pearl, Probabilistic Reasoning for Intelligent Systems. Morgan Kaufmann, San Francisco, 1988; Heckerman et al., Machine Learning 20:197-243, 1995) has been studied extensively in the statistical machine learning literature. A Bayesian net is a graph-based model for representing probabilistic relationships between random variables. The random variables, which may, for example, represent gene expression levels, are modeled as graph nodes; probabilistic relationships are captured by directed edges between the nodes and conditional probability distributions associated with the nodes. In the context of genomic analysis, this framework is particularly attractive because it allows hypotheses of actor interactions (e.g., gene-gene, gene-protein, gene-polymorphism) to be generated and evaluated in a mathematically sound manner against existing evidence. Network reconstruction, pathway identification, diagnosis, and outcome prediction are among the many challenges of current interest that Bayesian networks can address. Introduction of new-network nodes (random variables) can model effects of previously hidden state variables, conditioning prediction on such factors as subject characteristics, disease subtype, polymorphic information, and treatment variables.
A Bayesian net asserts that each node (representing a gene or an outcome) is statistically independent of all its non-descendants, once the values of its parents (immediate ancestors) in the graph are known. Even with the focus on restricted subnetworks, the learning problem is enormously difficult, due to the large number of genes, the fact that the expression values of the genes are continuous, and the fact that expression data generally is rather noisy. Our approach to Bayesian network learning employs an initial gene selection algorithm to produce 20-30 genes, with a binary binning of each selected gene's expression value. The set of selected genes then is searched exhaustively for parent sets of size 5 or less, with the induced candidate networks being evaluated by the BD scoring metric (Heckerman et al., Machine Learning 20:197-243, 1995). This metric, along with our variance factor, is used to blend the predictions made by the 500 best scoring networks. Each of these 500 Bayesian networks can be viewed as a competing hypothesis for explaining the current evidence (i.e., training data and prior knowledge) for the corresponding classification task, and the gene interactions each suggests are potentially of independent interest as well.
Bayesian analysis allows the combining of disparate evidence in a principled way. Abstractly, the analysis synthesizes known or believed prior domain information with bodies of possibly diverse observational and experimental data (e.g., microarrays giving gene expression levels, polymorphism information, clinical data) to produce probabilistic hypotheses of interaction and prediction. Prior elicitation and representation quantifies the strength of beliefs in domain information, allowing this knowledge and observational and experimental data to be handled in uniform manner. Strong priors are akin to plentiful and reliable data; weaker priors are akin to sparse, noisy data. Similarly, observational and experimental data can be qualified by its reliability, accuracy, and variability, taking into account the different sources that produced the data and inherent differences in the natures of the data. Of course, observational and experimental data will eventually dominate the analysis if it is of sufficient size and quality.
In the context of outcome and disease subtype prediction, we applied a highly customized and extended Bayesian net methodology to high-dimensional sparse data sets with feature interaction characteristics such as those found in the genomics application. These customizations included the parent-set model for Bayesian net classifiers, the blending of competing parent sets into a single classifier, the pre-filtering of genes for information content, Helman-Veroff normalization to pre-process the data, methods for discretizing continuous data, the inclusion of a variance term in the BD metric, and the setting of priors. Our normalization algorithm is designed to address inter-sample differences in gene expression levels obtained from the microarray experiments It proceeds by scaling each sample's expression levels by a factor derived from the aggregate expression level of that sample. In this way, afer scaling, all samples have the same aggregate expession level.
A set of training data, labeled with outcome or disease subtype, was used to generate and evaluate hypotheses against the training data. A cross validation methodology was employed to learn parameter settings appropriate for the domain. Surviving hypotheses were blended in the Bayesian framework, yielding conditional outcome distributions. Hypotheses so learned are validated against an out-of-sample test set in order to assess generalization accuracy. This approach was successfully used to identify OPAL1/G0 as strong predictors of outcome in pediatric ALL as described in Example II.
2. Support Vector Machines.
Support vector machines (SVMs) are powerful tools for data classification (Cristianini et al., An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, 2000; Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1999). The original development of the SVM was motivated, in the simple case of two linearly separable classes, by the desire to choose an optimal linear classifier out of an infinite number of potential linear classifiers that could separate the data. This optimal classifier corresponds not only to a hyperplane that separates the classes but also to a hyperplane that attempts to be as far away as possible from all data points. If one imagines inserting the widest possible corridor between data points (with data points belonging to one class on one side of the corridor and data points belonging to the other class on the other side), then the optimal hyperplane would correspond to the imaginary line/plane/hyperplane running through the middle of this corridor.
The SVM has a number of characteristics that make it particularly appealing within the context of gene selection and the classification of gene expression data, namely: SVMs represent a multivariate classification algorithm that takes into account each gene simultaneously in a weighted fashion during training, and they scale quadratically with the number of training samples, N, rather than the number of features/genes, d. In order to be computationally feasible, other classification methods first have to reduce the number of dimensions (features/genes), and then classify the data in the reduced space. A univariate feature selection process or filter ranks genes according to how well each gene individually classifies the data. The overall classification is then heavily dependent upon how successful the univariate feature selection process is in pruning genes that have little class-distinction information content. In contrast, the SVM provides an effective mechanism for both classification and feature selection via the Recursive Feature Elimination algorithm (Guyon et al., Machine Learning 46, 389-422, 2002). This is a great advantage in gene expression problems where d is much greater than N, because the number of features does not have to be reduced a priori.
Recursive Feature Elimination (RFE) is an SVM-based iterative procedure that generates a nested sequence of gene subsets whereby the subset obtained at iteration k+1 is contained in the subset obtained at iteration k. The genes that are kept per iteration correspond to genes that have the largest weight magnitudes—the rationale being that genes with large weight magnitudes carry more information with respect to class discrimination than those genes with small weight magnitudes. We have implemented a version of SVM-RFE and obtained excellent results—comparable to Bayesian nets—for a range of infant leukemia classification tasks with blinded test sets.
3. Discriminant Analysis
Discriminant analysis is a widely used statistical analysis tool that can be applied to classification problems where a training set of samples, depending a set of p feature variables, is available (Duda et al., Pattern Classification (Second Edition). Wiley, New York, 2001). Each sample is regarded as a point in p-dimensional space Rp, and for a g-way classification problem, the training process yields a discriminant rule that partitions Rp into g disjoint regions, R1 R2, . . . , Rg. New samples with unknown class labels can then be classified based on the region Ri to which the corresponding sample vector belongs. In many cases, determining the partitioning is equivalent to finding several linear or non-linear functions of the feature variables such that the value of the function differs significantly between different classes. This function is the so-called discriminant function. Discriminant rules fall into two categories: parametric and nonparametric. Parametric methods such as the maximum likelihood rule—including the special cases of linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) (Mardia et al., Multivariate Analysis. Academic Press, Inc., San Diego, 1979; Dudoit et al., J. Am. Stat. Ass'n. 97(457):77-87, 2002)—assume that there is an underlying probability distribution associated with each of the classes, and the training samples are used to estimate the distribution parameters. Non-parametric methods such as Fisher's linear discriminant and the k-nearest neighbor method (Duda et al., Pattern Classification (Second Edition). Wiley, New York, 2001) do not utilize parameter estimation of an underlying distribution in order to perform classifications based on a training set.
In applying discriminant analysis techniques to the gene expression classification problem, both categories of methods have been utilized, specifically LDA (binary classification) and Fisher's linear discriminant (multi-class problems). For the statistically designed infant leukemia dataset, LDA was applied successfully to the AML/ALL and t(4;11)/NOT class distinctions. Fisher's linear discriminant analysis was further used to identify three well-separated classes that clustered within the seven nominal MLL subclasses for which karyotype labels were available.
For both classes of methods, a major issue is the question of feature selection, either as an independent step prior to classification, or as part of the classifier training step. In addition to a simple ranking based on t-test score as used by other researchers (Dudoit et al., J. Am. Stat. Ass'n. 97(457):77-87, 2002), the use of stepwise discriminant analysis for determining optimal sets of distinguishing genes has been investigated. One challenge in the stepwise approach is the rapid increase of computational burden with the number of genes included in the initial set; the method is therefore being implemented on large-scale parallel computers. An alternative gene selection approach that is presently being explored is stepwise logistic regression (McCulloch et al., Generalized, Linear, and Mixed Models Wiley, New York, 2001; SAS Online Documentation for SAS System, Release 8.02, SAS Institute, Inc. 2001). Logistic regression is known to be well suited to binary classification problems involving mixed categorical and continuous data or to cases where the data are not normally distributed within the respective classes.
Various extensions of these techniques are expected to enable the incorporation of both categorical and continuous data in our classifiers. This enables the inclusion of known, discrete clinical labels (age, sex, genotype, white blood count, etc.) in conjunction with microrarray expression vectors, in order to perform more accurate classifications, particularly for outcome prediction. In addition to logistic regression as mentioned previously, one approach is to first quantify the categorical data (Hayashi, Ann. Inst. Statist. Math. 3:69-98, 1952), and then apply standard non-parameteric statistical classification techniques in the usual manner.
4. Fuzzy Inference
Traditional classification methods are based on the theory of crisp sets, where an element is either a member of a particular set or not. However many objects encountered in the real world do not fall into precisely defined membership criteria.
Fuzzy inference (also known as fuzzy logic) and adaptive neuro-fuzzy models are powerful learning methods for pattern recognition. Although researchers have previously investigated the use of fuzzy logic methods for reconstructing triplet relationships (activator/repressor/target) in gene regulatory networks (Woolf et al., Physiol. Genomics 3:9-15, 2000), these techniques have not been previously applied to the genomic classification problem. A significant advantage of fuzzy models is their ability to deal with problems where set membership is not binary (yes/no); rather, an element can reside in more than one set to varying degrees. For the classification problem, this results in a model that, like probabilistic methods such as Bayesian nets, can accommodate data sources that are incomplete, noisy, and may ultimately include non-numeric text-based expert knowledge derived from clinical data; polymorphisms or other forms of genomic data; or proteomic data that must be incorporated into the overall model in order to achieve a more accurate classification system in clinical contexts such as outcome prediction.
5. Genetic Algorithms
Fuzzy logic and other classification methods require the use of a gene selection method in order to reduce the size of the feature space to a numerically tractable size, and identify optimal sets of class-distinguishing genes for further analysis. We are exploring the use of genetic algorithms (GAs) for determining optimal feature sets during the training phase of a classification problem.
A GA is a simulation method that makes it possible to robustly search a very large space of possible solutions to an optimization problem, and find candidate solutions that are near optimal. Unlike traditional analytic approaches, GAs avoid “local minimum” traps, a classic problem arising in high-dimensional search spaces. Optimal feature selection for gene expression data where the sample size N is much smaller than the number of features d (for the Affymetrix leukemia data analyzed, d≈12,000 and N≈100-200) is a classic problem of this type. A genetic algorithm code has been developed by us to perform feature selection for the K-nearest neighbors classification method using the recently proposed GA/KNN approach (Li et al., Bioinformatics 17:1131-42, 2001); this method, which is compute-intensive, has been implemented on the parallel supercomputers. The approach has been applied recently to the statistically designed infant leukemia dataset, to evaluate biologic clusters discovered using unsupervised learning (VxInsight). The GA/KNN method was able to predict the hypothesized cluster labels (A,B,C) in one-vs.-all classification experiments.
Summary
To identify genes strongly predictive of outcome in pediatric ALL, we analyzed the retrospective case control study of 254 pediatric ALL samples described in Example IA. We divided the retrospective POG ALL case control cohort (n=254) into training (⅔ of cases, the “preB training set”) and test (⅓ of cases, the “preB test set”) sets, applied a Bayesian network approach, and performed statistical analyses. A particularly gene predictive of outcome in pediatric ALL was identified, corresponding to Affymetrix probe set 38652_at (“G0”: Hs. 10346; NM_Hypothetical Protein FLJ20154; partial sequences reported in GenBank Accession Number NM—017787; NM—017690; XM—053688; NP—060257). Two other genes, Affymetrix probe set 34610_at (“G1”: GNB2L1: G protein β2, related sequence 1; GenBank Accession Number NM—006098;); and Affymetrix probe set 35659_at (“G2”: IL-10 Receptor alpha; GenBank Accession Number U00672), were identified as associated with outcome in conjunction with OPAL1/G0, but were substantially less significant. OPAL1/G0, which we have named OPAL1 for outcome predictor in acute leukemia, was a heretofore unknown human expressed sequence tag (EST), and had not been fully cloned until now. G1 (G protein β2, related sequence 1) encodes a novel RACK (receptor of activated protein kinase C) protein and is involved in signal transduction (Wang et al., Mol Biol Rep. 2003 March; 30(1):53-60) and G2 is the well-known IL-10 receptor alpha.
Importantly, we found that OPAL1/G0 was highly predictive of outcome (p=0.0014) in a completely different set of ALL cases assessed by gene expression profiling by another laboratory (the St. Jude set of ALL cases previously published by Yeoh et al. (Cancer Cell 1; 133-143, 2002)). We also observed a trend between high OPAL1/G0 and improved outcome in our retrospective cohort of infant ALL cases.
We have fully cloned the human homologue of OPAL1/G0 and characterized its genomic structure. OPAL1/G0 is highly conserved among eukaryotes, maps to human chromosome 10q24, and appears to be a novel transmembrane signaling protein with a short membrane insertion sequence and a potential transmembrane domain. This protein may be a protein inserted into the extracellular membrane (and function like a signaling receptor) or within an intracellular domain. We have also developed specific automated quantitative real time RT-PCR assays to precisely monitor the expression of OPAL1/G0 and other genes that we have found to be associated with outcome in ALL.
Bayesian Networks
We used Bayesian networks, a supervised learning algorithm as described in Example IB, to identify one or more genes that could be used to predict outcome as well as therapeutic resistance and treatment failure. To identify genes strongly predictive of outcome in pediatric ALL, we divided the retrospective POG ALL case control cohort (n=254) described above into training (⅔ of cases) and test (⅓ of cases) sets. Computational scientists were blinded to all clinical and biologic co-variables during training, except those necessary for the computational tasks. A large number of computational experiments were performed, in order to properly sample the space of Bayesian nets satisfying the constraints of the problem. In the context of high-dimensional gene expression data, the inclusion of more nets than is typical in the literature appears to yield better results. Our initial results using Bayesian nets showed classification rates in excess of 90-95%.
Identification of Genes Associated with Outcome
A particularly strong set of genes predictive of outcome was identified by applying a Bayesian network analysis to the preB training set. The three genes in the strongest predictive tree identified by Bayesian networks are provided in Table 2.
Our analysis showed that pediatric ALL patients whose leukemic cells contain relatively high levels of expression of OPAL1/G0 have an extremely good outcome while low levels of expression of OPAL1/G0 is associated with treatment failure. At the top of the Bayesian network, OPAL1/G0 conferred the strongest predictive power; by assessing the level of OPAL1/G0 expression alone, ALL cases could be split into those with good outcomes (OPAL1/G0 high: 87% long term remissions) versus those with poor outcomes (OPAL1/G0 low: 32% long term remissions, 68% treatment failure). Detailed statistical analyses of the significance of OPAL1/G0 expression in the retrospective cohort revealed that low OPAL1/G0 expression was associated with induction failure (p=0.0036) while high OPAL1/G0 expression was associated with long term event free survival (p=0.02), particularly in males (p=0.0004). Higher levels of OPAL1/G0 expression were also associated with certain cytogenetic abnormalities (such as t(12;21)) and normal cytogenetics. Although the number of cases were limited in our initial retrospective cohort, low levels of OPAL1/G0 appeared to define those patients with low risk ALL who failed to achieve long term remission, suggesting that OPAL1/G0 may be useful in prospectively identifying children who would otherwise be classified as having low or standard risk disease, but who would benefit from further intensification.
The pre-B test set (containing the remaining 87 members of the pre-B cohort) was also analyzed. Unexpectedly, OPAL1/G0 when evaluated on the pre B test set showed a far less significant correlation with outcome. This is the only one of the four data sets (infant, pre-B training set, pre-B test set, and the Downing data set, below) in which no correlation was observed. One possible explanation is that, despite the fact that the preB data set was split into training and test sets by what should have been a random process, in retrospect, the composition of the test set differed very significantly from the training set. For example, the test set contains a disproportionately high fraction of studies involving high risk patients with poorer prognosis cytogenetic abnormalities which lack OPAL1/G0 expression; these children were also treated on highly different treatment regimens than the patients in the training set. Thus, there may not have been enough leukemia cases that expressed higher OPAL1/G0 levels (there were only sixteen patients with a high OPAL1/G0 expresion value in the test set) for us to reach statistcal significance. Finally, the p-value observed for the preB training set was so strong, as was the validation p-value for OPAL1/G0 outcome prediction in the independent data sets, that it would be virtually impossible that the observed correlation between OPAL1/G0 and outcome is an artifact.
In addition, PCR experiments recently completed in accordance with the methods outlined in Example III support the importance of OPAL1/G0 as a predictor of outcome. Although a large fraction (30%) of the 253 pre B cases could not be assessed by PCR due to sample availability, including 8 of the 36 cases from the pre B training set in which OPAL1/G0 was highly expressed, an initial analysis of the results on the 174 cases which could be assessed supports a clear statistical correlation between OPAL1/G0 and outcome (a p-value of about 0.005 on the PCR data alone, when the OPAL1/G0-high threshold is considered fixed). It should be noted that these PCR samples cut across the pre B training and test sets, and that the PCR results do not seem to reflect the same dichotomy in training and test set correlation as was seen in the microarray data. Furthermore, the RNA target for the PCR assays (directly amplified cDNA) and the Afffymetrix array experiments (linearly amplified twice cDNA) are quite different and it is satisfying that a moderately strong correlation (r=0.62) was observed between these two quite distinct methodologies to quantitate gene expression. Additionally, in a random re-sampling (bootstrap) procedure reported in herein, OPAL1/G0 does exhibit consistent significance.
As noted above, we evaluated expression levels of OPAL1/G0 in three entirely different and disjoint data sets. Two of the data sets, described above, were derived from retrospective cohorts of pediatric ALL patients registered to clinical trials previously coordinated by the Pediatric Oncology Group (POG): the statistically designed cohort of 127 infant leukemias (the “infant” data set); and the statistically designed case control study of 254 pediatric B-precursor and T cell ALL cases (the “pre-B” data set), specifically the 167 member “pre-B” training set. The third data set evaluated was a publicly available set of ALL cases previously published by Yeoh et al. (the “Downing” or “St. Jude” data set) (Cancer Cell 1; 133-143, 2002).
The following breakdown was conditioned on OPAL1/G0 expression level at its optimal threshold value, which in all data sets examined fell near the top quarter (22-25%) of the expression values. Low OPAL1/G0 expression was defined as having normalized OPAL1/G0 expression below this value, while high OPAL1/G0 expression was defined as having normalized OPAL1/G0 expression equal to or greater than this value.
Of the 167 members of the pre-B training set, 73 (44%) were classified as CCR (continuous complete remission) while 94 (56%) were classified as FAIL. Relative to the optimized threshold value, OPAL1/G0 expression was determined to be low in 131 samples and high in 36 samples. The following statistics were observed.
Low OPAL1/G0 Expression (131 Samples):
High OPAL1/G0 Expression (36 Samples):
The following p-values were observed for gene uncorrelated with outcome possessing any threshold point yielding our observations or better:
The significance of these p-values must be assessed in light of the fact that 12,000+ genes can be so considered (individually) against the training data. Even with 1.25×104 candidate genes, under the null hypothesis of no associations, the expected number of genes that possess a threshold yielding our observation (or better) is still extremely small:
Our analysis of the pre-B training set showed that pediatric ALL patients whose leukemic cells contain relatively high levels of expression of OPAL1/G0 have an extremely good outcome while low levels of expression of OPAL1/G0 is associated with treatment failure. In the entire pediatric ALL cohort under analysis, 44% of the patients were in long term remission for 4 or more years, while 56% of the patients had failed therapy within 4 years. At the top of the Bayesian network, OPAL1/G0 conferred the strongest predictive power; by assessing the level of OPAL1/G0 expression alone, ALL cases could be split into those with good outcomes (OPAL1/G0 high: 87% long term remission; 13% failures) versus those with poor outcomes (OPAL1/G0 low: 32% long term remissions, 68% treatment failure). Although the numbers are quite small as we continue down the Bayesian tree, outcome predictions can be somewhat refined by analyzing the expression levels of these G1 and G2.
We also investigated OPAL1/G0 expression level statistics across biological classifications typically utilized as predictive of outcome. The following represents a breakdown of OPAL1/G0 expression statistics within various subpopulations of the pre-B training set. The OPAL1/G0 threshold obtained by optimization in the original pre-B training set analysis (a value of 795) was used.
Normal Genotype (65 Members)
Outcome Statistics
Low OPAL1/G0 Expression (51 Samples)
High OPAL1/G0 Expression (14 Samples)
Outcome Statistics
Low OPAL1/G0 Expression (Bottom 78%; 10 Samples)
High OPAL1/G0 Expression (Top 22%; 14 Samples)
Outcome Statistics
Low OPAL1/G0 Expression (13 Samples)
High OPAL1/G0 Expression (4 Samples)
Outcome Statistics
Low OPAL1/G0 Expression (34 Samples)
High OPAL1/G0 Expression (1 Sample)
Outcome Statistics
Low OPAL1/G0 Expression (12 Samples)
High OPAL1/G0 Expression (0 Samples)
Outcome Statistics
Low OPAL1/G0 Expression (80 Samples)
High OPAL1/G0 Expression (29 Samples)
Outcome Statistics
Low OPAL1/G0 Expression (51 Samples)
High OPAL1/G0 Expression (7 Samples)
Outcome Statistics
Low OPAL1/G0 Expression (58 Samples)
High OPAL1/G0 Expression (21 Samples)
Outcome Statistics
Low OPAL1/G0 Expression (73 Samples)
High OPAL1/G0 Expression (15 Samples)
The data evidence a number of interesting interactions between OPAL1/G0 and various parameters used for risk classification (karyotype and NCI risk criteria). Age and WBC (White Blood Count), in particular, are routinely used in the current risk stratification standards (age>10 years or WBC>50,000 are high risk), yet OPAL1/G0 appears to be the dominant predictor within both of these groups. Indeed, OPAL1/G0 appears to “trump” outcome prediction based on these biological classifications. In other words, regardless of biological classification, roughly the same OPAL1/G0 statistics are observed. For example, even though MLL translocation t(12:21) is generally associated with very good outcome, when OPAL1/G0 is low, the t(12:21) outcome is not nearly as good as when OPAL1/G0 is high. This association is also present in the Downing data set (see below), according to our analysis, although it was not recognized by Yeoh et al.
In our retrospective cohort balanced for remission/failure, OPAL1/G0 was more frequently expressed at higher levels in ALL cases with normal karyotype (14/65, 22%), t(12;21) (14/24, 58%) and hyperdiploidy (4/17, 24%%) compared to cases with t(1;19) (2%) and t(9;22) (0%). 86% of ALL cases with t(12;21) and high OPAL1/G0 achieved long term remission; while t(12;21) with low OPAL1/G0 had only a 40% remission rate. Interestingly, 100% of hyperdiploid cases and 93% of normal karyotype cases with high OPAL1/G0 attained remission, in contrast to an overall remission rate of 40% in each of these genetic groups.
Although our cases numbers were small and the cases highly selected, there appeared to be a correlation between low OPAL1/G0 and failure to achieve remission in children with low risk disease, suggesting that OPAL1/G0 may be useful in prospectively identifying children with low or standard risk disease who would benefit from further intensification. Interestingly, in children in the standard NCI risk group (age<10; WBC<50,000) and an overall remission rate of 50% in this case control study, children with high OPAL1/G0 had an 86% long term remission rate. Even children with NCI high risk criteria (age>10, WBC>50,000) and an overall remission rate of 31% in this selected cohort, children with high OPAL1/G0 had an 87% remission rate. Finally, OPAL1/G0 was also highly predictive of outcome in T ALL (p=0.02), as well as B precursor ALL.
Our statistical analyses of the significance of OPAL1/G0 expression in the retrospective cohort revealed that low OPAL1/G0 expression was associated with induction failure (p=0.0036) while high OPAL1/G0 expression was associated with long term event free survival (p=0.02), particularly in males (p=0.0004). Interestingly, actual quantitative levels of OPAL1/G0 appeared to be important and there was a clear expression threshold between remission and relapse.
To further validate the role of OPAL1/G0 in outcome prediction in ALL, we tested the usefulness of OPAL1/G0 on two additional independent set of ALL cases, the statistically designed infant ALL cohort described above, and the publicly available St. Jude ALL dataset (Yeoh et al., Cancer Cell 1; 133-143, 2002). In these two data sets, it should be noted that we explored OPAL1/G0's statistics specifically, and (in this context) did not test any other gene. Hence, the significance of the p-values computed for these two additional data sets should not be balanced against a large number of potential candidate genes. There was only one gene considered, and that was OPAL1/G0. Further, the threshold was fixed using the top 22% (17 samples) expressors as the threshold, not optimized as it was in the analysis of the pre-B training set.
Of the 76 members of the infant ALL data set (restricted to no-marginal ALLs), 29 (38%) were classified as CCR (continuous complete remission) while 47 (62%) were classified as FAIL. The following statistics were observed.
Low OPAL1/G0 Expression (Bottom 78%; 59 Samples)
High OPAL1/G0 Expression (Top 22%; 17 Samples)
For the Downing data set, “Heme Relapse” and “Other Relapse” were classified as FAIL and the 2nd AML was discarded as being of indeterminate outcome. Of the 232 members of the Downing data set, 201 (87%) were classified as CCR (continuous complete remission) while 31 (13%) were classified as FAIL. The following statistics were observed.
Low OPAL1/G0 Expression (Bottom 78%; 181 Samples)
High OPAL1/G0 Expression (Top 22%; 51 Samples)
Low OPAL1/G0 Expression (Bottom 75%; 173 Samples)
High OPAL1/G0 Expression (Top 25%; 59 Samples)
It should be noted that all three of these data sets are totally disjoint, and as a result the latter two studies represent independent validation of the statistics observed in the original “pre-B” training set evaluation. As previously discussed, Yeoh et al. were not able to identify or validate genes associated with outcome in the St. Jude dataset. The St. Jude data set was not balanced for remission versus failure; the overall long term remission rate in this series of cases was 87%. Additionally, Yeoh et al. employed SVMs which included many genes in the classification that masked the significance of OPAL1/G0. Our adapted BD metric controlled model complexity and allowed the significance of OPAL1/G0 to be realized in this data set. Indeed, we found that 100% of the cases in this St. Jude series with higher levels of OPAL1/G0, regardless of karyotype, achieved long term remissions (p=0.0014).
The following represents a breakdown of OPAL1/G0 expression statistics within various subpopulations of the Downing data set. The OPAL1/G0 threshold (25%) obtained by optimization in the original pre-B training set analysis was used. This yields 59 high OPAL/G0 cases in total, which are distributed among the various subgroups as follows:
TEL-AML1 (61 Members)
Outcome Statistics
Low OPAL1/G0 Expression (7 Samples)
High OPAL1/G0 Expression (54 Samples)
Outcome Statistics
Low OPAL1/G0 Expression (46 Samples)
High OPAL1/G0 Expression
Outcome Statistics
Low OPAL1/G0 Expression (18 Samples)
High OPAL1/G0 Expression (1 Sample)
Outcome Statistics
Low OPAL1/G0 Expression (19 Samples)
High OPAL1/G0 Expression (2 Samples)
The human homologue of OPAL1/G0 was fully cloned and its genomic structure characterized. OPAL1/G0 is highly conserved among eukaryotes, maps to human chromosome 10q24, and appears to be a novel, potentially transmembrane signaling protein. To clone OPAL1/G0, RACE PCR was used to clone upstream sequences in the cDNA using lymphoid cell line RNAs. The genomic structure was derived from a comparison of OPAL1/G0 cDNAs to contiguous clones of germline DNA in GenBank. The total predicted mRNA length is approximately 4 kb (
Interestingly, preliminary studies reveal that the gene for OPAL1/G0 encodes two different RNAs (and potentially up to five different RNAs through alternative splicing of upstream exons) and presumably two different proteins based on alternative use of 5′ exons (1a and 1). These two different transcripts are differentially expressed in leukemia cell lines.
Interestingly, OPAL1/G0 appears to encode at least two different proteins through alternative splicing of different 5′ exons (1 and 1a).
Table 3 shows the results of RT-PCR assays performed in accordance with Example III that confirm alternative exon use in OPAL1/G0. While all leukemia cell lines (REH, SUPB15) contained an OPAL1/G0 transcript with exons 2-3 and with exon 1a fused to exon 2; only ½ of the cell lines and the primary human ALL samples isolated to date express the alternative transcript (exon 1 fused to exon 2).
100 ng equivalent RNA into each reaction
OPAL1/G0 appears to be rather ubiquitously expressed and it has a highly similar murine homologue. Preliminary examination of the translated coding sequence (
Characterization of G1 and G2
G1 encodes an interesting protein, a G protein β2 homologue that has been linked to activation of protein kinase C, to inhibition of invasion, and to chemosensitivity in solid tumors. It is also interesting that the Bayesian tree linked G2 (the IL-10 receptor a) to G6 and OPAL1/G0, as the interleukin IL-10 has been previously linked to improved outcome in pediatric ALL (Lauten et al., Leukemia 16:1437-1442, 2002; Wu et al., Blood Abstract, Blood Supplement 2002 (Abstract #3017).). IL-10 has been shown to be an autocrine factor for B cell proliferation and also to suppress T cell immune responses. ALL blasts that express a shortened, alternatively spliced form of IL-10 have been shown to have significantly better 5 year EFS (p=0.01) (Wu et al., Blood Abstract, Blood Supplement 2002 (Abstract #3017).). We have developed specific primers and probes to assess the direct expression of each of these genes in large ALL cohorts (Example III).
We have developed direct RT-PCR assays to precisely measure the quantitative expression of these genes in an efficient two step approach. First, we perform a “qualitative” screen for positive cases using non-quantitative “end-point” RT-PCR assays with rapid and very inexpensive detection using the Agilent bioanalyzer. Positive cases detected with this simple, rapid, and highly sensitive methodology are then targeted for precise quantitative assessment of a particular gene using automated quantitative real time RT-PCR (Taqman technology).
Sequences for OPAL1/G0 (both splice forms) and pseudogenes identified from the other chromosomes were aligned, and OPAL1/G0 primers were designed to maximize the differences between the true OPAL1/G0 genes and the pseudogenes. The primers and probe sequences developed for specific quantitative assessment of the two alternatively spliced forms of OPAL1/G0 (assessed by quantifying mRNAs with exon 1 fused to exon 2 or alternatively exon 1a fused to exons 2) are:
For Exon 1 or 1a to 2 (the (+) Primers are Sense and the (−) are Antisense):
For Exon 2 to 3:
Using these primers and probes, we have developed highly sensitive and specific automated quantitative assays for OPAL1/G0 expression over a wide expression range. A standard curve was derived for the automated quantitative RT-PCR assays for the two alternatively spliced forms of OPAL1/G0. The assays were performed in cell lines shown in Table 3 and are highly linear over a large dynamic range.
The primers and probe sequences developed for specific quantitative assessment of G1 (G protein β2) and G2 (IL10Rα) are:
G1: Spans 2 introns (1.9 kb and 0.3 kb); from Exon 3 to Exon 5; 278 bp Amplicon
G2: Spans 1 Intron of 3.6 kb; from Exon 3 to Exon 4; 189 bp Amplicon
Automated Quantitative RT-PCR
We routinely develop fluorogenic RT-PCR assays to detect the presence of leukemia-associated human genes, as well as viral genes, using an automated, closed analysis system (ABI 7700 Sequence Detector, PE-Applied Biosystems Inc., Foster City, Calif.). Accurate standards of cloned cDNAs containing the gene or sequence of interest are prepared in plasmid vectors (pCR 2.1, Invitrogen). These standard reagents are quantitated by fluorescence spectrometry and serially diluted over a six log range. Quantitative PCR is carried out in triplicate in the ABI 7700 instrument in a 96 well plate format, with optimized PCR conditions for each assay. The reverse transcriptase reaction employs 1 μg of RNA in a 20 μl volume consisting of 1× Perkin Elmer Buffer II, 7.5 mM MgCl2, 5 μM random hexamers, 1 mM dNTP, 40 U RNasin and 100 U MMLV reverse transcriptase. The reaction is performed at 25° C. for 10 minutes, 48° C. for 60 min and 95° C. for 10 min. 4.5 μl of the resulting cDNA is used as template for the PCR. This is added to 1× Taqman Universal PCR Master Mix (PE Applied Biosystems, Foster City, Calif.), 100 nM fluorescently labeled Taqman probe and 100 nM of each primer in a 50 μl volume. The PCR is performed in the PRISM 7700 Sequence Detector as follows: “hot start” for 10 minutes at 95° C. (with AmpliTaq Gold, Perkin-Elmer) then 40 two step cycles of 95° C. for 15 seconds and 60° C. for 1 minute. This system detects the level of fluorescence from cleaved probe during each cycle of PCR and constructs the data into an amplification plot. This displays the threshold cycle (CT) of detection for each reaction. The data collection and analysis are performed with Sequence Detection System v.1.6.3 software (PE Applied Biosystems, Foster City, Calif.). A standard concentration curve of CT versus initial cDNA quantity is generated and analyzed with the ABI software to confirm the sensitivity range and reproducibility of the assay. To confirm RNA integrity, a segment of the ubiquitously expressed E2A gene is also amplified in all patient samples, along with a standard E2A or GAPDH cloned cDNA dilution series. This method can be utilized to quantitatively analyze expression levels for any gene of interest.
First the preB training set was discretized using a supervised method as well as an unsupervised discretization. Next p-values were computed by using the formula (nr/nh−er)/(er*(1−er)) then determine the likelihood of this value in a t-distribution. Here nr=number of remissions for gene high, nh=number of cases with gene high, and er=expected value of remission (44%). The results were ranked according to this p-value, and the preB training set was compared to entire preB data set. The results are shown in Tables 4-7. Tables 4 and 6 show two different lists based on the training set; Tables 5 and 7 show the entire preB data set for each of the two different approaches, respectively. Note that OPAL1/G0 is included on each of these lists as correlated with outcome, and there is substantial overlap between and among the lists. These lists thus identify potential additional genes that may be associated with OPAL1/G0 metabolically, might help determine the mechanism through which OPAL1/G0 acts, and might identify additional therapeutic or diagnostic genes.
Cumulative Distribution Functions (CDFS)
First the Helman-Veroff normalization scheme was applied to the preB training set data. Then CDFs were computed, followed by average and maximum difference between the CDFs. The distance between the two CDF curves reflects how different the two distributions are, hence the maximum distance and the average distance are measures of the way the two set differed. Finally, the genes were ranked by average and maximum differences for pre B training set and the entire preB data set. The results are shown in Tables 8-11.
The relative expression level for Affymetrix probe 39418_at (i.e., 0.5=half the median) was plotted across our pediatric ALL cases organized by outcome: FAIL (left panel) or REM (right panel), using Genespring (Silicon Genetics). The results showed that this gene's relative expression appears to be higher across failure cases and lower across remission cases.
Affymetrix probe 39418_at appears to be a probe from the consensus sequence of the cluster AJ007398, which includes Homo sapiens mRNA for the PBK1 protein (Huch et al., Placenta 19:557-567 (1998)). The sequence's approved gene symbol is DKFZP564M182, and the chromosomal location is 16p13.13. Originally, PBK1 was discovered through the identification of differentially expressed genes in human trophoblast cells by differential-display RT-PCR Functional annotations for the gene that this probe seems to represent are incomplete, however the sequence appears to have a protein domain similar to the ribosomal protein L1 (the largest protein from the large ribosomal subunit). PBK1 may prove to be a useful therapeutic target for treatment of pediatric ALL.
We applied linear SVM, SVM with recursive feature elimination (SVM-RFE), and nonlinear SVM methods (polynomial and gaussian) to the pre B training dataset o get a list of genes associated with CCR/Fail. Table 12 shows the top 40 genes for evaluating remission from failure (CCR vs. FAIL). However, CCR vs. FAIL was nonseparable using these methods.
We also used SVM-RFE to discriminate between members of the data set who have the certain MLL translocations from those who do not. Table 13 shows the top 40 genes found to discriminate t(12;21) from not t(12;21) (we excluded patients without t(12;21) data from this analysis). Table 14 shows the top 40 genes found to discriminate t(1;19) from not t(1;19). We did not see significant separation for t(9;22), t(4;11) or hyperdiploid karyotypes.
We then performed analyses to discriminate CCR vs. FAIL conditioned on various karyotypes (t(12;21), t(1;19), t(9/22), t(4,11) and hyperdiploid (Tables 15-19). Although the results are marginal, the associated gene lists may be useful in risk classification and/or the development of therapeutic strategies.
To identify genes strongly predictive of outcome in pediatric ALL, we divided the retrospective POG ALL case control cohort (n=254) described above into training (⅔ of cases) and test (⅓ of cases) sets performed statistical analyses using VxInsight and ANOVA. Through this approach, we identified a limited set of novel genes that were predictive of outcome in pediatric ALL. Table 20 provides the list of the top 20 genes associated with remission vs. failure in the pre-B ALL cohort; several of these genes appear to reach statistical significance. These top 20 genes are ranked by ANOVA f statistics; we have also converted these f statistics to corresponding p values. Not surprisingly, overall p values for outcome prediction in VxInsight or with any other method are less than for prediction of genetic types or morphologic labels; we assume that this is due to the significant biologic heterogeneity of the outcome variable in our patient cohorts. A positive value in the “Contrast” column of Table 20 reveals that the gene identified is expressed at relatively higher levels in patients in long term remission; a negative value indicates that a particular gene is expressed at lower levels in patients in remission and at higher levels in patients who fail therapy.
Interestingly, OPAL1/G0 (38652_at; NM-Hypothetical protein FLJ20154); see Example II), at position 12 on the table, appeared on gene lists produced by four different supervised learning algorithms (Bayesian networks, SVM, Neurofuzzy logic) and was ranked extremely high (top 5 or 10 genes) or at the top (Bayesian) with each of these very distinct modeling approaches. The degree of overlap between outcome genes detected with these different modeling algorithms was quite striking.
The gene at the number 5 position on the table (Affy number 671_at, known as SPARC, secreted protein, acidic, cysteine-rich (osteonectin)) is interesting as a possible therapeutic target. Osteonectin is involved in development, remodeling, cell turnover and tissue repair. Because its principal functions in vitro seem to be involved in counteradhesion and antiproliferation (Yan et al., J. Histochem. Cytochemi. 47(12):1495-1505, 1999). These characteristics may be consistent with certain mechanisms of metastasis. Further, it appears to have a role in cell cycle regulation, which, again, may be important in cancer mechanisms. Furthermore, it should be noted that other significant (about p<0.10) genes on the list might also have mechanisms that, together, could be combined to suggest mechanisms consistent with the observed differences in CCR and FAILURE. The group of genes, or subsets of it, may have more explanatory power than any individual member alone.
In the context of disease karyotype subtype prediction, we applied Bayesian nets to the preB training set data in a supervised learning environment. A set of training data, labeled with disease karyotype subtype, is used to generate and evaluate hypotheses against the test data. The Bayesian net approach filters the space of all genes down to K (typically, K bewteen 20 and 50) genes selected by one of several evaluation criteria based on the genes' potential information content. For each classification task attempted, a cross validation methodology is employed to determine for what value of K, and for which of the candidate evaluation criteria, the best Bayesian net classification accuracy is observed in cross validation. Surviving hypotheses are blended in the Bayesian framework, yielding conditional outcome distributions. Hypotheses so learned are validated against an out-of-sample test set in order to assess generalization accuracy.
Approximately 30 genes from prediction of each karyotype were combined. The gene list in Table 21 can discriminate translocations of t(12;21), t(1;19), t(4;11), t(9;22) as well as hyperdiploid and hypodiploid karyotype from normal karyotype.
Classification Tasks and the Class Labels
We used supervised learning methods to discriminate between positive and negative outcomes (Remission (CCR) vs. Failure) and to discriminate among various karyotypes. The outcome statistics for the 167 member “training set” derived from the 254 member pre-B ALL cohort are shown in Table 22.
To discriminate among the various karyotypes, we considered three different classifications of the karyotypes (Table 23).
Data Preprocessing
The analysis was performed on the data set comprising the 167 training cases. We first eliminated the 54 of 67 control genes (those with accession ID starting with the AFFX prefix), and then eliminated those genes with all calls “Absent” for all 167 training cases. With these genes removed from the original 12625, we were left with 8582 genes. In addition, a natural log transformation was performed on 8582×167 matrix of the gene expression values prior to further analysis.
Ranking Genes
The 8582 genes are ranked by two methods based on ANOVA for each classification exercise. Method 1 ranks the genes in terms of the F-test statistic values. Method 2 assigns a rank to each gene in terms of the number of pairs of classes between which the gene's expression value differs significantly. Note that for binary classification problem (remission vs. failure), only Method 1 is applicable.
Discriminating Among the Classes
An optimal subset of prediction genes is further selected from top 200 genes of a given ranked gene list through the use of stepwise discriminant analysis. Then the classes are discriminated using the linear discriminant analysis. The classification error rate is estimated through the leave-one-out cross validation (LOOCV) procedure. A visualization of the class separation for each classification is produced with canonical discriminant analysis.
Discrimination Between Remission and Failure
The one way ANOVA (F-test, which is equivalent to two-sample t-test in this case) was performed for each of 8582 pre-selected genes and then the all these genes were ranked in terms of the p-value of F-test. The numbers of 0.05 and 0.01 significant discriminating genes are 493 and 108, respectively. The top 20 significant discriminating genes are tabulated in Table 24. An optimal subset of discriminating genes were selected from the top 200 genes using the stepwise discriminant analysis was also prepared. The number one significant prediction gene in both the ranked gene list and the optimal subset of prediction genes is 38652_at, hypothetical protein FLJ20154, corresponding to OPAL1/G0.
The optimal subset of discriminating genes was utilized with linear discriminant analysis to predict for Remission (CCR) vs. failure in the training set of 167 cases. The success rate of the predictor is estimated in three ways: Resubstitution, LOOCV with Fold Independent prediction genes, LOOCV with Fold dependent prediction genes, and the results are listed in Table 25.
Note:
stepwise = 1 means that the gene belongs to the optimal subset of prediction genes.
Discrimination Among various Karyotypes
The one way ANOVA (F-test) and the pair-wise comparison t-test were performed for each of 8582 pre-selected genes for the karyotype classification problem. Next, all genes were ranked based on the two methods described for outcome discrimination. The top 20 genes in each of ranked gene lists are listed in Tables 26 and 27. The tables also list the values of the statistic F and the number of pairs of classes between which the gene expression value differs at confidence level α=0.10, which is labeled as SIG#. An optimal subset of discriminating genes for each of the classes was selected from the top 200 genes with the stepwise discriminant analysis.
Each optimal subset of discriminating genes was utilized with linear discriminant analysis to predict for the corresponding classes in the training set of 167 cases. The success rate of the predictor is estimated in the same way as described in above for outcome prediction and the results are listed in Table 28.
Uniformly Significant Genes that Are Correlated with CCR vs. Failure
The three data sets derived from the retrospective statistically designed 254 member Pre-B data set were analyzed for their association with outcome: the 167 member training set, the 87 member test set and overall 254 member data set. Three measures were used: ROC accuracy A, F-test statistic and TNoM. Table 29 shows a list of genes correlated with outcome with the ranks determined by these different measures with the different data sets.
Two genes were consistently significant in both training and test sets and they are number one and number two significant genes in the overall data set. The two genes are 39418_at, DKFZP564M182 protein (PBK1) and 41819_at, FYN-binding protein (FYB-120/130). FYN is a tyrosine kinast found in fibroblasts and T lymphocytes (Popescu et al., Oncogene 1(4):449-451 (1987)).
Unexpectedly, although OPAL1/G0 was the most significant gene in the training data set, it was a much less significant gene in the test data set. Indeed, most of the significant genes in training set, like OPAL1/G0, became less significant in test set. The fact that most genes that did well in the training set did poorly in the test set lends support to our hypothesis that the test set's composition differed significantly from that of the training set. We therefore sought to increase the robustness of this statistical analysis.
Re-Sampling Training and Test Data Sets
Our goal was to identify genes that are significant irrespective of the data set. One way to get a stable (robust) list of genes that are highly correlated with the distinction of CCR vs. Failure is through the use of a random re-sampling (bootstrap) procedure. We randomly divided the overall data set into training and test sets 172 times. The numbers of CCRs and Failures in the training set was fixed to agree with the original training set, (i.e. 73 CCR s and 94 Failures). Each time the genes are ranked in the same way as in Table 1. That is, we produced 172 tables like Table 29 for the 172 different training and test sets.
We found that the gene ranking in the two data sets (training and test randomly resampled in each time) are typically quite different. However, in most runs, the two genes 39418_at (PBK1) and 41819_at (FYN-binding protein) were consistently significant in both the random training and test sets. We called these two genes the uniformly most significant genes. OPAL1/G0 (38652_at) also consistently shows significance.
Generation of a Robust Gene List (a List of Uniformly Significant Genes)
The following rule was used to assign a quantitative value to each gene to evaluate the extent that the gene is uniformly significant: in each training and test set, the genes are ranked by three measures. After 172 resamplings, each gene has 172 ranks on the three measures in each of two data sets. We calculate the average or mean of the 172 ranks of each gene. We then sorted the genes on the mean ranks. In this way we get a robust gene list corresponding to each of three measures in each of the two data sets.
The top 100 genes in the robust gene list are presented in Table 30 with the robust ranks determined by the three different measures. We found that the ranks in training set and test set closely agree with each other and with the rank determined by the overall data set. The two most uniformly significant genes (39418_at and 41819_at) were ranked first and second. OPAL1/G0 survives in this analysis and had good average ranks on the three measures, but was only about 10th best overall.
*** = AFFX-HUMGAPDH/M33197_M_at
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens,
Homo sapiens
Homo sapiens
Threshold independent supervised learning algorithms (ROC) and Common Odds Ratio) were used to identify genes associated with outcome in the 167 member pediatric ALL training set described in Example II. Data were normalized using Helman-Veroff algorithm. Nonhuman genes and genes with all call being absent were removed from the data.
The following lists of genes associated with outcome (CCR vs. FAIL) were identified.
*indicates low expression value predicts CCR
Homo sapiens cDNA FLJ30991 fis, clone HLUNG1000041
Homo sapiens CD44 isoform RC (CD44) mRNA, complete cds
*indicates low expression value predicts CCR
*indicates low expression value predicts CCR
Homo sapiens cDNA FLJ30991 fis, clone HLUNG1000041
Homo sapiens CD44 isoform RC (CD44) mRNA, complete cds
Rank 1 and A1 are calculated based on the data with T-cell patients removed.
Rank 2 and A2 are calculated based on all 167 training data.
*indicates low expression value predicts CCR
factor, arginine/serine-rich6
growth factor receptor
2-inhibitory kinase
Homo sapiens, Similar to RKENcDNA 2600001B17 gene, done IMAGE2822298, mRNA, partial cds
Rank 1 and A1 are calculated based on the T-cell data only.
Rank 2 and A2 are calculated based on all 167 training data.
The following tables represent consolidations of a number of different gene lists representing rankings in B-Cell and T-Cell data sets.
Homo sapiens cDNA FLJ31861 fis, clone
Homo sapiens clone IMAGE 25997
Homo sapiens, Similar to RIKEN cDNA
Homo sapiens CD44 isoform RC (CD44) mRNA,
Homo sapiens mRNA; cDNA DKFZp586C091
Homo sapiens cDNA FLJ30991 fis, clone
Homo sapiens clone FBA1 Cri-du-chat region
Homo sapiens, clone MGC: 9628 IMAGE:
Homo sapiens cDNA FLJ30991 fis, clone
Homo sapiens CD44 isoform RC (CD44) mRNA,
Homo sapiens cDNA FLJ32819 fis, clone
Introduction. This Example summarizes and correlates selected gene lists predictive of outcome (specifically, CCR vs. Failure) obtained for the pre-B ALL cohort described in Example IB. “Task 2” refers to CCR vs. FAIL for B-cell+T-cell patients; “Task 2a” is CCR vs. FAIL for B-cell only patients. Gene lists selected for evaluation were produced by the following methods: (1) a compilation of genes identified using feature selection combined with a supervised learning techniques such as SVM/RFE, Discriminant Analysis/t-test, Fuzzy Inference/rank-ordering statistics, and Bayesian Nets/TNoM; note that SVM/RFE and Bayesian Net/TNoM are both multivariate (MV) gene selection techniques; the others are univariate; (2) TNoM gene selection; (3) supervised classification; (4) empirical CDF/MaxDiff method; (5) threshold independent approach; (6) GA/KNN; (7) uniformly significant genes via resampling; (8) ANOVA “gene contrast” lists derived via VxInsight.
The techniques fall into two broad categories, which we have termed univariate and multivariate.
Group I (univariate). These methods evaluate the significance of a given gene in contributing to outcome discrimination on an individual basis. They include:
The top genes in each group are identified and to determine how often the same genes turn up repeatedly within each group. The following two tables correspond to Tasks 2 (Table 40) and 2a (Table 41). The top 20 genes found in Table 40 are listed in Table 42 with more detailed annotations.
Homo sapiens cDNA FLJ30991
Homo sapiens CD44 isoform
Homo sapiens mRNA; cDNA
Homo sapiens mRNA; cDNA
Homo sapiens
Summary
Current ALL classification schemes mask inherent biologic predictors of outcome. Classification schemes that reflect the underlying biology of this disease could guide patients to more tailored treatments. To develop gene expression-based classification schemes related to the pathogenic basis of pediatric lymphoblastic leukemia, gene expression patterns observed in the statistically designed cohort containing 254 pediatric acute lymphoid leukemia (ALL) cases described in Example IA were examined using Affymetrix U95AV2 oligonucleotide microarrays. Additionally, in order to model remission vs. failure conditioned to predictive cytogenetics, matched patients were selected among all major genetic prognostic groups (MLL/AF4, BCR/ABL, E2A/PBX1, TEL/AML1, hyperdiploidy, and hypodiploidy).
The data were analyzed for class discovery using unsupervised clustering methods (hierarchical clustering and a force directed algorithm) and for class prediction using supervised learning techniques including Bayesian Nets, Fisher's Discriminant, and Support Vector Machines. During initial exploratory data analysis, several distinct clusters were observed using unsupervised clustering methods. Interestingly, no correlation between the currently employed risk classification groups and these clusters was evident. In particular, ALL cases characterized by accepted “good” and “poor” risk genetics were distributed differentially among the identified clusters. This class discovery analysis indicates a more complex intrinsic genetic and biologic background in pediatric ALL than currently appreciated.
Gene expression profiles associated with achievement of remission vs. treatment failure were then sought using supervised learning techniques. Derived predictive algorithms were applied to a training set of the data. Their performance was evaluated with multiple cross validation and bootstrap runs, with an average accuracy of 72% and low variance. These models are being tested on the validation set. The results provide evidence of additional heterogeneity of pediatric ALL, which may relate to novel transformation pathways and clinical outcomes.
Data Analysis
The analysis of the gene expression data was done in a two-step approach. First, in order to identify potential clusters and inherent biologic groups, a large number of clinical co-variables were correlated with the expression data using unsupervised clustering methods such as hierarchical clustering, principal component analysis and a force-directed clustering algorithm coupled with a novel visualization tool (VxInsight). For class prediction, supervised learning methods such as Bayesian Networks, Support Vector Machines with Recursive Feature Elimination (SVM-RFE), Neuro-Fuzzy Logic and Discriminant Analysis were employed to create classification algorithms. The performance of these classification algorithms was evaluated using fold-dependent leave-one-out cross validation (LOOCV) techniques. These methods combined allowed the identification of genes associated with remission or treatment failure and with the different translocations across the dataset.
Results
To explore potential clusters driven by gene expression profiles, the initial analysis of the pediatric ALL cohort was accomplished using a force directed clustering algorithm coupled with a novel visualization tool, VxInsight as described in Example IB. Unexpectedly, we discovered 9 novel biologic clusters of ALL (2 distinct T-cell ALL clusters (S1 and S2) and 7 (2 related clusters are seen in cluster X) distinct B-lineage ALL clusters (A, B, C, X, Y, Z)) each with distinguishing gene expression profiles. Using ANOVA, we identified over 100 statistically significant genes uniquely distinguishing each of these cohorts; a list of the top statistically significant genes distinguishing each cluster is provided in Table 43. Review of these lists of genes reveals many interesting signaling molecules and transcription factors. The X cluster (which contains two highly related clusters) is quite unique in having expression of several genes regulating methylation and folate metabolism.
Examination of the cluster data reveals that while there are some trends, no cytogenetic abnormality precisely defines or is correlated with any specific cluster. It is interesting that cases with a t(12;21) or hyperdiploidy, both conferring low risk and good outcomes, tend to cluster together; although combinations of these cases can be seen primarily in clusters C and Z as well as the top component of the X cluster indicating that there is still heterogeneity in gene expression profiles associated with these clusters. On the terrain map from VxInsight (
The stability and structure of the clusters was explored using methods of data perturbation. Because the clusters appeared to be steady, subsequent exploration of the group-characterizing genes was performed using analysis of variance (ANOVA). This method was applied to order all of the genes with respect to differential expressions between the groups. The strongest 0.1% of the genes were tabulated in lists. The strength of these gene lists was studied using statistical bootstrapping as described in Example IB, and suggested that the identified groups represented well-separated patient subclasses.
Surprisingly, with the exception of the T-ALL cases (clusters S1 and S2), the clustering of ALL patients was independent of karyotype, suggesting that common tumor genetics, as currently applied to prognostic schema, do not strongly influence or drive innate expression profiling in pediatric ALL. However, fewer “adverse prognosis” genetics were distributed among certain clusters (e.g. C and Z). Remarkably, patients with translocations such as t(9;22)/BCR-ABL, t(1;19)/E2A/PBX1, and t(12;21)/TEL/AML1, were distributed among several clusters, suggesting biologic heterogeneity beyond the present tendency to group these various entities for the purpose of prognosis and outcome prediction. The results of these class discovery methods suggested that, when applied to our patient data set, unsupervised techniques elucidate underlying novel subgroups pediatric ALL. In turn, this reassessment of tumor heterogeneity encourages the design of additional studies to ascertain whether these data can enhance the discriminatory power of currently employed prognostic variables.
Analysis was therefore next focused on class prediction. The process of defining the best set of discriminating genes between known subsets of samples can be accomplished using supervised learning techniques such as Bayesian Networks, linear discriminant analysis and support vector machines (SVM). In contrast to unsupervised methods that generate inherent “classes” for each gene or patient, supervised learning methods are trained to recognize “known classes”, creating classification algorithms that may also uncover interesting and novel therapeutic targets.
Genes that best discriminated T-lineage ALL from B-lineage ALL were identified using principal component analysis and ANOVA of the cluster-differentiating genes generated from the VxInsight analysis. Significant overlap was observed between the 2 methods used in our analysis of the T-cell ALL gene expression profile, as well as with published data (Yeoh et al., Cancer Cell 1; 133-143, 2002), both in the actual presence of the same genes, as well as in relative rank (
Gene expression profiles characteristic of translocation types were derived using supervised learning techniques. 147 genes derived from Bayesian network analysis that allowed the identification of samples within each of the major translocation groups with accuracy rates higher than 90%, as calculated by fold dependent leave-one-out cross validation. This filtered data analysis of gene expression conditioned on karyotype generated distinct case clustering, confirming that unique gene expression “signatures” identify defined genetic subsets of ALL. This corroborates recently published data (Yeoh et al., Cancer Cell 1; 133-143, 2002) which revealed that karyotypic sub-groups of ALL are characterized by specific gene expression profiles (
Comparison with Method and Results of Yeoh et al. (Cancer Cell 1; 133-143, 2002)
Yeoh et al., in a study performed on the “Downing” or “St. Jude” data set as described above, reported that pediatric ALL cases clustered according to the recurrent cytogenetic abnormalities associated with ALL, and thus, that cytogenetics could define these intrinsic groups. However, careful reading of this report and the methods of analysis employed reveals that these investigators did not perform and/or report the results of true unsupervised learning methods and class discovery. Rather, these investigators first used supervised learning algorithms (primarily Support Vector Machines) to identify short lists of expressed genes that were associated with each recurrent cytogenetic abnormality in ALL. Using a highly selected set of only 271 genes that resulted from this supervised learning approach, they then performed hierarchical clustering or PCA using the expression data derived from only this set of selected genes. As would be expected from this approach, distinct ALL clusters could be defined based on shared gene expression profiles and each cluster was associated with a specific cytogenetic abnormality. However, this approach did not reveal what the underlying structure was in the gene expression profiles if one took a truly unbiased approach and performed real class discovery.
Furthermore, although Yeoh et al. attempted to use supervised learning methods to identify genes associated with outcome, they were not successful. Potential outcome genes identified in training sets could not be confirmed in independent test sets, indicating that the learning algorithms employed were “over-fitting” the data—a not uncommon problem with supervised learning algorithms. Another potential problem with these studies was that was no statistical design for the cases selected for study in this St. Jude cohort; cases were selected simply based on sample availability. Thus, in contrast to our retrospective POG cohort design in which cases with long term remission were balanced roughly 50:50 with cases that failed, the St. Jude cases were predominantly cases with long term remission (>80%), making the modeling in the St. Jude dataset far more difficult. We have come to appreciate is how important statistical design and case selection is to any array study (indeed for any scientific study) and that for supervised learning algorithms and class prediction, it is very important to have the label that one is trying to predict (such as outcome or the presence of a particular genetic abnormality) balanced 50:50 in the cohort undergoing modeling and within the training and test sets.
To determine if traditional biologic and clinical subgroups of infant leukemia cases could be identified by gene expression profiles, 126 infant leukemia cases registered to NCI-sponsored Infant Oncology Group/Children's Oncology Group treatment trials were studied using oligonucleotide microarrays containing 12,625 probe sets (Affymetrix U95Av2 array platform). Of the 126 cases, 78 were ALL (62%), 48 were AML (38%) and 53 (42%) cases had translocations involving the MLL gene (chromosome segment 11q23).
The exploratory evaluation of our data set was performed in several steps. The first step of the analysis was the construction of predictive classification algorithms that linked the gene expression data to the traditional clinical variables that define treatment, using supervised learning techniques, and further, the exploration of patterns that could predict patient outcomes. As described in Example IA, the 126 patients were divided into statistically balanced and representative training (82 patients) and test sets (44 patients), according to the clinical labels (leukemia lineage, cytogenetics and outcome). For classification purposes, two primary supervised approaches were used; Bayesian networks and recursive feature elimination in the context of Support Vector Machines (SVM-RFE). Additional classification techniques (Fuzzy inference and Discriminant Analysis) were used for comparison purposes.
All of the classification algorithms were established based on the training data set and then used to predict the class of the samples in the test. Two statistical significance tests were employed to further evaluate the prediction accuracy of those algorithms. The first tested whether the success rate of each classification algorithm was significantly greater than the value that would be expected by chance alone (i.e. whether the success rate was significantly greater than 0.5, where the success rate=# of correct predictions/total predictions). The second prediction accuracy test used the true positive proportion (TP) and false positive proportion (FP) value computed for one of the two classes. For a binary classification problem, TP is the ratio of correctly classified samples in the class to the total number in the class. FP is the proportion of misclassified samples in the other class to the total number in that class. To test whether the true positive proportion was significantly greater than the false positive proportion, we used Fisher's exact test. The p-values of the two tests along with the success rates for each of the classification algorithms with respect to the classification tasks of interest are listed in Table 44. As shown in the table, both evaluation methods confirmed that the classification results for the lineage labels (ALL/AML) and the presence or absence of t(4;11) rearrangements were significant at level α=0.05. In other words, all the supervised learning techniques employed were successful in finding a distinction between ALL and AML samples, and the presence/absence of t(4;11) rearrangements. Detailed gene lists that characterize each one of these leukemia subtypes were obtained from all the classifiers used and can be found in the Supplemental Information.
Class Discovery: Expression Profiles Partition Infant Leukemia Cases in Three Groups
To explore the intrinsic structure of the data independent of known class labels, several unsupervised clustering methods were employed. These unsupervised approaches allowed patient separation into potential clusters based on overall similarity in gene expression, without prior knowledge of clinical labels. As discussed below, although certain degree of correlation of our unsupervised clusters with traditional lineage (ALL/AML) and cytogenetics (MLL or not) could be observed, those labels were not enough to completely explain the results of our unsupervised clustering methods, suggesting that leukemia lineage and cytogenetics are not the only important factors in driving the inherent biology of these gene expression groups.
Initially, the data were investigated using agglomerative hierarchical clustering (Eisen et al., 1998). Hierarchical clustering results from the 126 infant leukemia samples using all genes yielded several groups that seemed to have no relation to the known lineage labels or the partition of the data suggested by the presence or absence of MLL rearrangements (see supplemental information).
The next technique used was Principal Component Analysis (PCA). PCA, closely related to the Singular Value Decomposition (SVD), is an unsupervised data analysis method whereby the most variance is captured in the least number of coordinates (Joliffe, 1986; Kirby, 2001; Trefethan & Bau, 1997). As shown in
The force-directed clustering algorithm (Davidson et al., 1998; 2001) places patients into clusters on the two-dimensional plane by minimizing two opposing forces. Briefly, the algorithm forms groups of patients by iteratively moving them toward one another with small steps proportional to the similarity of their gene expression, as measured by Pearson's correlation coefficient. To avoid collecting all of the patients into a single group, a counteracting force pushes nearby patients away from each other. This force increases in proportion to the number of nearby patients and has a strong local effect, thus acting to disperse any concentrated group of patients. This force affects only patients who are near each other, while the attractive force (Pearson's similarity) is independent of distance. The algorithm moves patients into a configuration that balances these two forces, thus grouping patients with similar gene expression. The spatial distribution of patients is then visualized on a three-dimensional plot, similar to a terrain map, where the height of the peaks denotes the local density of patients. This method has been useful in inferring functions of uncharacterized genes clustered near other genes with known functions (Kim, 2001) and for the analysis and mapping of various databases (Davidson, 1998, Werner-Washburne, 2002)
When applied to the infant data, the VxInsight clustering algorithm identifies several pattern of gene expression across the patients, suggesting the existence of three major groups (
Several further explorations into the VxInsight clusters were pursued. Linear discriminant analysis was used to separate the three clusters. The object of discriminant analysis is to weight and linearly combine information from the feature variables in a manner that clearly distinguishes labeled subclasses of the data. More specifically, the idea is to find a linear function of the feature variables such that the value of this function differs significantly between different classes. This function is the so-called discriminant function. Then, ANOVA was performed to rank cluster-discriminating genes in term of their F-test statistic values. From the top genes, a subset of genes was selected using stepwise discriminant analysis. This subset of genes served as the discriminating variables needed by linear discriminant analysis. The error rate of the derived classification results was 0.03, as estimated using fold-independent leave one out cross-validation (LOOCV). This indicated that the three VxInsight clusters were well separated.
There was also support for the existence of the VxInsight groupings even when only a subset of the data was used. For example, three widely separated groups of patients were observed when using only the patients in the training set. The addition of the rest of the patients in the test set, however, did induce change. In particular, the cores of Groups A and Groups C remained separated while Group B increased to include marginal members of groups A and C. The observation of similar grouping in both the entire set and the training set alone increased our interest in discerning the force driving the clustering for the patients in the VxInsight groups.
Finally, we confirmed our ability to classify patients into the VxInsight groups A, B, and C. Such a demonstration showed that we could categorize new patients into our grouping in the future (e.g. for treatment or diagnosis). To accomplish this, a multi-class Support Vector Machine (SVM) was trained using the actual labels A, B, and C in the patients from the training set. The prediction accuracy of this SVM on the test set was 95%. To verify that this result was improbable by chance alone, a randomization test was also performed. The labels A, B and C were randomly reassigned to the patients in both the training and the test set. Then, another SVM was trained with the re-labeled data in the training set. This SVM achieved a prediction accuracy of only 40% on the test set.
Subsequent exploration of the cluster-characterizing genes was performed using analysis of variance (ANOVA). The F-scores from this method were used to order all of the genes with respect to differential expressions between the groups. The strongest ranking 100 genes were then tabulated. The stability and strength of these gene lists was studied using statistical bootstrapping (Efron, 1979; Hjorth, 1994). This analysis provided a powerful method for determining the likelihood that a gene (high on the gene list determined from the actual data) would remain near the top of any gene list generated from experimental data similar to that which we actually observed. While this method allowed the identification of genes that had a unique pattern in each cluster and defined inter-clusters differences, it is important to make a distinction between these genes and the ones active in each one of the clusters (See supplemental information). Some very surprising findings were uncovered after completing a detailed analysis of the genes responsible for the distinction between clusters. These results, together with the stability of the clusters, suggest that the identified groups represent well-separated patient subclasses.
Approaches to Inherent Biology
Expression profiles identified different clusters of infant leukemia cases, not related to type labels or cytogenetics, but characterized by different genes predominantly expressed in, and probably related to, three independent disease initiation mechanisms. The sets of cluster-discriminating genes can be used to identify each biologic group and hence represent potentially important diagnostic and therapeutic targets (See Table 45). A heat map/dendrogram was produced with the top 30 genes that characterized each one of the three clusters, generated from the ANOVA analysis. Analysis of these genes revealed patterns that imply different features with potential clinical relevance.
The top cluster of cases (
Other genes in this group with an absolutely unique pattern of expression include growth inhibitory factors like methallothionein 3 (MT3), embryonic cell transcription factors (UTF1) and stem cell antigens (prostate stem cell antigen) with remarkable homology to cell surface proteins that characterize the earliest phases of hematopoietic development (Reiter, 1998).
The left cluster of cases (
Finally, the third rightmost cluster (
Altogether, the results of our class discovery methods suggested that, when applied to our patient data set, unsupervised techniques elucidate underlying novel subgroups of infant leukemia cases. In turn, this reassessment of tumor heterogeneity encourages the design of additional studies to ascertain whether these data can enhance the discriminatory power of currently employed prognostic variables.
Heterogeneous Distribution of the MLL Cases
The most common mutations in infant leukemia are translocations of the MLL gene at chromosome band 11q23. Interestingly, the MLL cases in cluster A (
MLL cases with the same translocation (t(4;11) in clusters A and B) had dramatic differences in their gene expression profiles. The mechanisms that might underlie this striking difference are currently under study. Genes that have common patterns in the MLL cases across all three clusters have been identified; as well as genes that are uniquely expressed and which distinguish each MLL translocation variant. Although MLL cases are not homogeneous, it is interesting that the list of statistically significant genes derived in this study is quite similar to the list of genes derived by previous groups working in infant MLL leukemia (Armstrong, 2002). For reasons not understood, infants are more prone to MLL rearrangements that inhibit apoptosis and cause transformation. (reviewed in Van Limbergen et al, 2002). Our results suggest that the MLL translocation in these patients may not be the “initiating” event in leukemogenesis. It is possible that after a distinct initiating event, the infant patient is more prone to rearrange the MLL gene, and that this rearrangement leads to further cell transformation by preventing apoptosis. Alternatively, an MLL translocation could be a permissive initiating event with leukemogenesis and final gene expression profile determined more strongly by second mutations. Further studies within the MLL group of infant leukemia patients may provide the clues to processes determinant in leukemic transformation.
Pathways to Failure in Infant Leukemia
In general, gene expression data has supported the existence of several categories of acute leukemias related to the traditionally defined leukemia types, ALL and AML (Golub, 1999; Moos, 2002). However, while expression profiling is a robust approach for the accurate identification of known lineage and molecular subtypes across acute leukemia cases, the search for clinically relevant prognosis discriminators based on gene expression patterns has been less successful (Armstrong, 2002; Ferrando, 2002; Yeoh, 2002). As shown in Table 46, only SVM-RFE was able to identify remission vs. failure across the unconditioned data set with a total error rate differing from random prediction (success rate of 64% at a significance level of p<0.1). Interestingly, the performance of our outcome classification algorithms was not increased when conditioned on either of the traditional criterion of lineage (ALL vs. AML) nor cytogenetics (MLL vs. not MLL), providing further support for questioning the predictive value of these traditional clinical labels in explaining outcome in infant patients. However, far greater success in outcome prediction is obtained when conditioning the classifying algorithms on the VxInsight cluster membership. The effect of the three VxInsight clusters on our ability to predict remission vs. failure was then explored. In particular, we attempted to predict remission vs. failure in the entire data set, conditioned on the knowledge of into which VxInsight cluster each case falls. The hope was that, by utilizing knowledge of VxInsight cluster membership, inter-cluster expression profile variability of cases—which is not necessarily relevant to outcome prediction—would be eliminated, allowing intra-cluster variability relevant to outcome prediction to be more easily discovered by our classification algorithms.
Table 46 demonstrates that prediction accuracy is gained by coupling the supervised learning algorithms with VxInsight clustering. In the Bayesian method, accuracy against the test set rises from 0.568 (p=0.256) to 0.703 (p=0.010). Smaller improvements after conditioning are found with the other methods as well. One can look also at the prediction accuracy within the VxInsight clusters individually. There again a general rise in accuracy is observed, though not to a level of statistical significance, possibly due to the small size and/or class balance of the individual clusters.
We note that, from the more abstract perspective of machine learning theory, the construction of the VxInsight clusters is viewed as an external feature creation algorithm that is applied to a data set before the supervised learning algorithms begin their training. In the application at hand, the created feature is 3-valued, indicating membership of a case in VxInsight cluster A, B, or C. This feature creation process is akin to the pre-selection of features, based on measures of information content, that is employed by many supervised learning algorithms when run on problems of high dimensionality. One difference between the VxInsight feature creation step and traditional feature selection is that VxInsight clustering is performed without knowledge of the class label to be predicted (outcome, in this context), and hence it is reasonable to perform the clustering on the entire data set (train and test sets combined) at once.
The relative strength of the gene lists and parent sets can be thought of as being correlated with the prediction accuracy within the corresponding VxInsight cluster. However, it is the application of the lists and parent sets together within the two-step VxInsight/supervised learning conditioning framework described above that achieves statistical significance in its accuracy.
It is rather unlikely that random chance alone would improve such accuracy levels, since a process independent of the best error rate generated the VxInsight clustering. These results are taken as strong evidence that the VxInsight patient clusters reflect biologically important groups and, are clinically exploitable. In contrast, comparable accuracy was not achieved by conditioning on either of the traditional criteria of ALL vs. AML, nor MLL vs. not MLL. This may indicate that, as determined by our molecular analysis, these traditional clinical criteria for segregating treatment cohorts are less defining than has been supposed.
Table 47 illustrates the resulting set of distinguishing genes associated with remission/failure in the overall data set (not partitioning by type, cytogenetics or cluster), which represent potentially important diagnostic and therapeutic targets. Some of these outcome-correlated genes include Smurf1, a new member of the family of E3 ubiquitin ligases. Smurf1 selectively interacts with receptor-regulated MADs (mothers against decapentaplegia-related proteins) specific for the BMP pathway in order to trigger their ubiquitination and degradation, and hence their inactivation. Targeted ubiquitination of SMADs may serve to control both embryonic development and a wide variety of cellular responses to TGF-β signals. (Zhu, 1999). Another interesting gene is the SMA- and MAD-related protein, SMAD5, which plays a critical role in the signaling pathway in the TGF-β inhibition of proliferation of human hematopoietic progenitor cells (Bruno, 1998). The list also included regulators of differentiation and development; bone morphogenetic 2 protein, member of the transforming growth factor-beta (TGF-β) super family and determinant in neural development (White, 2001); DYRK1, a dual-specificity protein kinase involved in brain development (Becker, 1998); a small inducible cytokine A5 (SCYA5), the T cell activation increased late expression (TACTILE), and a myeloid cell nuclear differentiation antigen (MNDA). It is remarkable that this list includes potential diagnostic or therapeutic targets like the ERG oncogene (V-ETS Avian Erythroblastosis virus E26 oncogene related, found in AML patients), the phospholipase C-like protein 1 (PLCL, tumor suppressor gene), a cystein rich angiogenic inducer (CYR61), and the MYC, MYB oncogenes. Other genes in the list are located in critical regions mutated in leukemia, which suggests their connection with the leukemogenic process. Such genes include Selenoprotein P (SPP1, 5q), the protein kinase inhibitor p58 (DNAJC3 in 13q32), and the cyclin C (CCNC).
Discussion
Traditionally, infant leukemia has been classified according to a host of clinical parameters and biological features that tend to correlate with prognosis. This classification system has been used for risk-based classification assignment. However, unexplained variability in clinical courses still exists among some individuals within defined risk-group strata. Differences in the molecular constitution of malignant cells within subgroups may help to explain this variability.
In our initial profiling of 126 infant acute leukemia cases, we have used microarray technology to both segregate patient subgroups and to uncover genetic diversity among patients that fall within the same traditional risk groups. The results reported here identify three previously unrecognized groups of infant leukemia cases, driven by differential gene expression pattern and possibly related to three independent disease initiation mechanisms. Two of these clusters support previous data about leukemic etiology: environmental exposure and viral infections, both of which may occur in utero.
Our data also supports the existence of a third group, with a particular gene expression pattern suggestive of a novel stem cell neoplasia with leukemic behavior. The genes expressed in most of these cases resemble those present in the hematopoietic/angioblastic primordial cell (Young, 1995; Eichman, 1997); see for example,
Together, these results indicate the occurrence of, at least, three inherent biological subgroups of infant leukemia, not precisely defined by traditional AML vs. ALL or cytogenetics labels; probably driven by characteristics with potential clinical relevance. Consideration of these three categories may enable selection criteria for more powerful clinical trials, and might lead to improved treatments with better success rates.
Methods
To develop gene expression-based classification schemes related to the pathogenic basis underlying the leukemic process in infant acute leukemia, 126 patients registered to NCI-sponsored Infant Oncology Group/Children's Oncology Group treatment trials were examined using Affymetrix U95Av2 oligonucleotide microarrays containing 12,625 probes. Of the 126 cases, 78 were ALL (62%), 48 were AML (38%) and 56 (44%) cases had translocations involving the MLL gene (chromosome segment 11q23). An average of 2×107 cells were used for total RNA extraction with the Qiagen RNeasy mini kit (Valencia, Calif.). The yield and integrity of the purified total RNA were assessed with the RiboGreen assay (Molecular Probes, Eugene, Oreg.) and the RNA 6000 Nano Chip (Agilent Technologies, Palo Alto, Calif.), respectively. Complementary RNA (cRNA) target was prepared from 2.5 μg total RNA using two rounds of Reverse Transcription (RT) and In Vitro Transcription (IVT). Following denaturation for 5 minutes at 70° C., the total RNA was mixed with 100 pmol T7-(dT)24 oligonucleotide primer (Genset Oligos, La Jolla, Calif.) and allowed to anneal at 42° C. The mRNA was reverse transcribed with 200 units Superscript II (Invitrogen, Grand Island, N.Y.) for 1 hour at 42° C. After RT, 0.2 vol. 5× second strand buffer, additional dNTP, 40 units DNA polymerase I, 10 units DNA ligase, 2 units RnaseH (Invitrogen) were added and second strand cDNA synthesis was performed for 2 hours at 16° C. After T4 DNA polymerase (10 units), the mix was incubated an additional 10 minutes at 16° C. An equal volume of phenol:chloroform:isoamyl alcohol (25:24:1) (Sigma, St. Louis, Mo.) was used for enzyme removal. The aqueous phase was transferred to a microconcentrator (Microcon 50. Millipore, Bedford, Mass.) and washed/concentrated with 0.5 ml DEPC water twice the sample was concentrated to 10-2011. The cDNA was then transcribed with T7 RNA polymerase (Megascript, Ambion, Austin, Tex.) for 4 hours at 37° C. Following IVT, the sample was phenol:chloroform:isoamyl alcohol extracted, washed and concentrated to 10-20 μl. The first round product was used for a second round of amplification which utilized random hexamer and T7-(dT)24 oligonucleotide primers, Superscript II, two RNase H additions, DNA polymerase I plus T4 DNA polymerase finally and a biotin-labeling high yield T7 RNA polymerase kit (Enzo Diagnostics, Farmingdale, N.Y.). The biotin-labeled cRNA was purified on Qiagen RNeasy mini kit columns, eluted with 50 μl of 45° C. RNase-free water and quantified using the RiboGreen assay. Following quality check on Agilent Nano 900 Chips, 15 μg cRNA were fragmented following the Affymetrix protocol (Affymetrix, Santa Clara, Calif.). The fragmented RNA was then hybridized for 20 hours at 45° C. to HG_U95Av2 probes. The hybridized probe arrays were washed and stained with the EukGE_WS2 fluidics protocol (Affymetrix), including streptavidin phycoerythrin conjugate (SAPE, Molecular Probes, Eugene, Oreg.) and an antibody amplification step (Anti-streptavidin, biotinylated, Vector Labs, Burlingame, Calif.). HG_U95Av2 chips were scanned at 488 nm, as recommended by Affymetrix. The expression value of each gene was calculated using Affymetrix Microarray Suite 5.0 software.
Data Presentation and Exclusion Criteria
Some of the criteria used as quality controls include: total RNA integrity, cRNA quality, array image inspection, B2 oligo performance, and internal control genes (GAPDH value greater than 1800).
Data Analysis
Affymetrix MAS 5.0 statistical analysis software was used to process the raw microarray image data for a given sample into quantitative signal values and associated present, absent or marginal calls for each probeset. A filter was then applied which excluded from further analysis all Affymetrix “control” genes (probesets labeled with AFFY_prefix), as well as any probeset that did not have a “present” call at least in one of the samples. For this analysis our Bayesian classification and VxInsight clustering analysis omitted this step, choosing instead to assume minimal a priori gene selection (Helman et al, 2003; Davidson et al., 2001). The filtering step reduced the number of probe sets from 12,625 to 8,414, resulting in a matrix of 8,414×N signal values, where N is the number of cases. The first stage of our analysis consisted of a series of binary classification problems defined on the basis of clinical and biologic labels. The nominal class distinctions were ALL/AML, MLL/not-MLL, achieved complete remission CR/not-CR. Additionally, several derived classification problems-based on restrictions of the full cohort to particular subsets of data such as a VxInsight cluster-were considered (see main text). The multivariate unsupervised learning techniques used included Bayesian nets (Helman et al., 2003) and support vector machines (Guyon et al., 2002). The performance of the derived classification algorithms was evaluated using fold-dependent leave-one-out cross validation (LOOCV) techniques. These methods combined allowed the identification of genes associated with remission or treatment failure and with the presence or absence of translocations of the MLL gene across the dataset. In order to identify potential clusters and inherent biologic groups, a large number of clinical co-variables were correlated with the expression data using unsupervised clustering methods such as hierarchical clustering, principal component analysis and a force-directed clustering algorithm coupled with the VxInsight visualization tool. Agglomerative hierarchical clustering with average linkage (similar to Eisen et al., 1998) was performed with respect to both genes and samples, using the MATLAB (The Mathworks, Inc.), the MatArray toolbox and native MATLAB statistics toolbox. The data for a given gene was first normalized by subtracting the mean expression value computed across all patients, and dividing by the standard deviation across all patients for each gene. The distance metric used was one minus Pearson's correlation coefficient; this choice enabled subsequent direct comparison with the VxInsight cluster analysis, which is based on the t-statistic transformation of the correlation coefficient (Davidson et al., 2001). The second clustering method was a particle-based algorithm implemented within the VxInsight knowledge visualization tool (www.sandia.gov/projectsJVxInsight.html). In this approach, a matrix of pair similarities is first computed for all combinations of patient samples. The pair similarities are given by the t-statistic transformation of the correlation coefficient determined from the normalized expression signatures of the samples (Davidson et al., 2001). The program then randomly assigns patient samples to locations (vertices) on a 2D graph, and draws lines (edges), thus linking each sample pair, and assigning each edge a weight corresponding to the pairwise t-statistic of the correlation. The resulting 2D graph constitutes a candidate clustering. To determine the optimal clustering, an iterative annealing procedure is followed, wherein a ‘potential energy’ function that depends on edge distances and weights is minimized, following random moves of the vertices (Davidson et al., 1998, 2001). Once the 2D graph has converged to a minimum energy configuration, the clustering defined by the graph is visualized as a 3D terrain map, where the vertical axis corresponds to the density of samples located in a given 2D region. The resulting clusters are robust with respect to random starting points and to the addition of noise to the similarity matrix, evaluated through its effect on neighbor stability histograms (Davidson et al., 2001).
Young, P. E., Baumhueter, S. and Laskiy, L. A. The sialomucin CD34 is expressed on hematopoietic cells and blood vessels during murine development. Blood, 85, 96-105 (1995).
Table 44. Class Predictor Performance In order to optimize gene selection and determine the success rate of each classifier, fold-dependent leave-one-out cross-validation was used on the training set (n = 82) followed by “single shot” prediction on our validation set (n = 44) using the trained classifiers.
r = Success rate;
p-value1 = Computed using the first method as described in Supplemental Information;
p-value2 = Computed using the second method as described in Supplemental Information.
*means that the predictor is significant at level α = 0.05
**means that the predictor is significant at level α = 0.01.
— indicates that the Fisher's exact test can not be fulfilled because two cells in the contingency table are zero.
Table 46. Overall success rates of class predictors after including the A, B and C cluster predictions.
r = Estimate of the success rate of the class predictor,
C.I. = 95% confidence interval of the success rate of the class predictor,
p-value = p-value of hypothesis test (see Supplemental Information).
*means that r > 0.5 at significance level α = 0.05.
**means that r > 0.5 at significance level α = 0.01.
Supplementary Information
Sample Management
Cell suspensions from diagnostic bone marrow aspirates or peripheral blood samples were handled according to the cryopreservation procedure of the St. Jude's Children's Hospital. Samples were retrieved from cryopreservation at −135° C. and thawed quickly at 37° C. and then washed by centrifugation at 1200 rpm for 5 minutes in warmed 20% (v/v) Fetal Bovine Serum in Dulbecco's Modified Minimum Essential Medium (Invitrogen, Grand Island, N.Y.). Cytospins were prepared from thawed samples, stained with Wright's stain and assessed for percent blasts and cell viability by light microscopy. Decanted cell pellets were used immediately for RNA purification.
RNA Extraction and T7 Amplification
An average of 2×107 cells were used for the total RNA extraction with the Qiagen RNeasy mini kit (VWR International AB, Stockolm, Sweden). The mean of the purified total RNA concentration was 0.5 μg/ul (approximately 25 μg of total RNA yield), as quantified with the RiboGreen assay (Molecular Probes, Eugene, Oreg.). All samples met assay quality standards as recommended by Affymetrix. The A260 nm/A280 nm ratio was determined spectrophotometrically in 10 mM Tris, pH 8.0, 1 mM EDTA, and all samples used for array analysis exceeded values of 1.8. The RNA integrity was analyzed by electrophoresis using the RNA 6000 Nano Assay run in the Lab-on-a Chip (Agilent Technologies, Palo Alto, Calif.). High quality RNA quality criteria included a 28S rRNA/18S rRNA peak area ratio>1.5 and the absence of DNA contamination. To prepare cRNA target, the mRNA was reverse transcribed into cDNA, followed by re-transcription in a method that uses two rounds of amplification devised for small starting RNA samples, kindly provided by Ihor Lemischka (Princeton University), with the following modifications: linear acrylamide (10 ug/ml, Ambion, Austin, Tex.) was used as a co-precipitant in steps that used alcohol precipitation and the starting amount of RNA was 2.5 ug of total RNA. Briefly, a T7-(dT)24 oligonucleotide primer (Genset Oligos, La Jolla, Calif.) was annealed to 2.5 ug of total RNA and reverse transcribed with Superscript II (Invitrogen, Grand Island, N.Y.) at 42° C. for 60 min. Second strand cDNA synthesis by DNA polymerase I (Invitrogen) at 16° C. for 120 min was followed by extraction with phenol:chloroform:isoamyl alcohol (25:24:1)(Sigma, St. Louis, Mo.) and microconcentration (Microcon 50. Millipore, Bedford, Mass.). RNA was then transcribed from the cDNA with a high yield T7 RNA polymerase kit (Megascript, Ambion, Austin, Tex.). The second round of amplification utilized random hexamer and T7-(dT)24 oligonucleotide primers, Superscript II, DNA polymerase I and a biotin labeling high yield T7 RNA polymerase kit (Enzo Diagnostics, Farmingdale, N.Y.). The biotin-labeled cRNA was purified on RNeasy mini kit columns, eluted with 50 ul of 45° C. RNase-free water and quantified using the RiboGreen assay.
Target Labeling and Probe Hybridization
Following quality check on Agilent Lab-on-a-Chip, 15 ug cRNA were fragmented for 35 minutes in 200 mM Tris-acetate pH 8.1, 150 mM MgOAc and 500 mM KOAc following the Affymetrix protocol (Affymetrix, Santa Clara, Calif.). The fragmented RNA was then hybridized for 20 hours at 45° C. to HG_U95Av2 probes. The hybridized probe arrays were washed and stained with the EukGE-WS2 fluidics protocol (Affymetrix), including streptavidin phycoerythrin conjugate (SAPE, Molecular Probes, Eugene, Oreg.) and an antibody amplification step (Anti-streptavidin, biotinylated, Vector Labs, Burlingame, Calif.). HG_U95Av2 chips were scanned at 488 nm, as recommended by Affymetrix. The images were inspected to detect artifacts. The expression value of each gene was calculated using Affymetrix GENECHIP software for the 12,625 Open Reading Frames on the probe set.
Data Presentation and Exclusion Criteria
Criteria used as quality control for exclusion of poor sample arrays included: total RNA integrity, cRNA quality, probe array image inspection, B2 oligo staining (used for Array grid alignment), and internal control genes (GAPDH value greater than 1800). Of the 142 cases initially selected, 126 were ultimately retained in the study; 16 cases were excluded from the final analysis due to poor quality total RNA or cRNA amplification or a poor hybridization (low percentage of expressed genes<10%, poor 3′/5′ amplification ratios).
Data Analysis
1. Data Preprocessing
The preprocessing stage was divided in filtering and transformation. For filtering, the control probesets were removed (i.e. probesets whose accession ID starts with the AFFX prefix), as well as all probesets that had at least one “absent” call (as determined by the Affymetrix MAS 5.0 statistical software) across all training set samples. In the transformation stage, the natural logarithm of the gene expression values (i.e. the signal values) was taken. This is the preprocessing method used for most of the analysis methods; except those in which different preprocessing is mentioned in the detailed information below.
2. Description of the supervised learning methods for class prediction The exploratory evaluation of our data set was performed in several steps. The first step was the construction of predictive classification algorithms that linked gene expression data to patient outcome as well as the traditional clinical variables that define prognosis. With previous knowledge of their sample nature, the 126 patients were divided into statistically balanced and representative training (82 patients) and test sets (44 patients), according to the clinical labels (leukemia lineage, cytogenetics and outcome). For classification purposes, several primary supervised approaches were used, including Bayesian networks, recursive feature elimination in the context of Support Vector Machines (SVM-RFE), linear discriminant analysis and fuzzy logics. Classification tasks were as follows:
2.1. Bayesian Networks
We employed the Bayesian network framework described in (6), without any data preprocessing. The Bayesian network modeling and learning paradigm was introduced in Pearl (1988) and Heckerman et al. (1995), (7, 8) and has been studied extensively in the statistical machine learning literature. Our work tailors this paradigm to the analysis of gene expression data in general and to the classification problem in particular. A Bayesian net is a graph-based model for representing probabilistic relationships between random variables. The random variables, which may, for example, represent gene expression levels, are modeled as graph nodes; probabilistic relationships are captured by directed edges between the nodes and conditional probability distributions associated with the nodes. A Bayesian net asserts that each node is statistically independent of all its no descendants, once the values of its parents (immediate ancestors) in the graph are known. That is, a node n's parents render n and its no descendants conditionally independent. In our modeling, we consider Bayesian nets in which each gene is a node, and the class label of interest is an additional node C having no children. The conditional independence assertion associated with (leaf) node C implies that the classification of a case q depends only on the expression levels of the genes, which are C's parents in the net. More formally, distribution Pr{q[C]\q[genes]} is identical to distribution Pr{q[C]\q[Par(C)]}, where Par(C) denotes the parent set of C. Note, in particular, that the classification does not depend on other aspects (other than the parent set of C) of the graph structure of the Bayesian net. Thus, while the Bayesian network model ultimately can be a highly appropriate tool for learning global gene regulatory networks, in the context of classification tasks such as those considered in this paper, the Bayesian network learning problem may be reduced to the problem of learning subnetworks consisting only of the class label and its parents. It is important to emphasize how this modeling differs from that of a naïve Bayesian classifier (9, 10) and from the generalization described in (11). A naive Bayesian classifier assumes independence of the attributes (genes), given the value of the class label. Under this assumption, the conditional probability Pr{q[C]\q[genes]} can be computed from the product Πgiεgenes Pr{q[gi]\q[C]} of the marginal conditional probabilities. The naive Bayesian model is equivalent to a Bayesian net in which no edges exist between the genes, and in which an edge exists between every gene and the class labels. We make neither assumption. Rather, we ignore the issue of what edges may exist between the genes, and compute Pr{q[C]\q[genes]} as Pr{q[C]\q[Par(C)]}, an equivalence that is valid regardless of what edges exist between the genes, provided only that Par(C) is a set of genes sufficient to render the class label conditionally independent of the remaining genes. Friedman et al. (1997) (11) drops the independence assumption of a naive Bayesian classifier and attempts to learn edges between the attributes (genes, in our context), while maintaining an edge from the class label into each attribute. This approach yields good improvements over naive Bayesian classifiers in the experiments (application domains other than gene expression data) reported in Friedman et al. (1997) (11). Our approach exploits a prior belief (supported by experimental results reported in (6) and in other gene expression analyses) that for the gene expression application domain, only a small number of genes is necessary to render the class label (practically) conditionally independent of the remaining genes. This both makes learning parent sets Par(C) tractable, and generally allows the quantity Pr{q[C]\q[Par(C)]} to be well estimated from a training sample. Even with the focus on restricted subnetworks, the learning problem is enormously difficult. Given a collection of training cases, we must learn one or more “plausible” Bayesian subnetworks, each consisting of class label node C and its parent set Par(C). The main factors contributing to the difficulty of this learning problem are the large number genes, the fact that the expression values of the genes are continuous, and the fact that expression data generally is rather noisy. The approach to Bayesian network learning employed here identifies parent sets which are supported by current evidence by employing an external gene selection algorithm which produces between 20 and 30 genes using a measure of class separation quality similar to the TNoM score described in (12, 13). A binary binning of each selected gene's expression value about a point of maximal class separation also is performed. The set of selected genes then is searched exhaustively for parent sets of size 5 or less, with the induced candidate networks being evaluated by the BD scoring metric (8). This metric, along with a variance factor, is used to blend the predictions made by the 500 best scoring networks (6). Each of these 500 Bayesian networks can be viewed as a competing hypothesis for explaining the current evidence (i.e., training data and simple priors) for the corresponding classification task, and the gene interactions each suggests are potentially of independent interest as well. Another significant aspect of our method involves a distinct normalization of the gene expression data for each classification task. We have found this a necessary follow-up step to the standard Affymetrix scaling algorithm. Our approach to normalization is to consider, for each case, the average expression value over some designated set of genes, and to scale each case so that this average value is the same for all cases. This approach allows the analysis to concentrate on relative gene expression values within a case by standardizing a reference point between cases. The designated reference genes for a given classification task are selected based on poorest class separation quality, which is a heuristic for identifying reference genes likely to be independent of the class label.
2.2 Support Vector Machines
Support vector machines (SVMs) are powerful tools for data classification (14, 15, 16). The development of the SVM was motivated, in the simple case of two linearly separable classes, by the desire to choose an optimal linear classifier out of an infinite number of linear classifiers that can separate the data. This optimal classifier corresponds not only to a hyperplane that separates the classes but also to a hyperplane that attempts to be as far away as possible from all data points. If one imagines inserting the widest possible corridor between data points (with data points belonging to one class on one side of the corridor and data points belonging to the other class on the other side), then the optimal hyperplane would correspond to the imaginary line/plane/hyperplane running through the middle of this corridor.
The SVM has a number of characteristics that make it particularly appealing within the context of gene selection and the classification of gene expression data, namely:
In order to be computationally feasible, other methods first have to reduce the number of dimensions (features/genes), and then classify the data in the reduced space. A univariate feature selection process or filter ranks genes according to how well each gene individually classifies the data (13,17). The overall SVM classification is then heavily dependent upon how successful the univariate feature selection process is in pruning genes that have little class-distinction information content. In contrast, the SVM provides an effective mechanism for both classification and feature selection via the Recursive Feature Elimination algorithm (18). This is a great advantage in gene expression problems where d is much greater than N because the number of features does not have to be reduced a priori.
Recursive Feature Elimination (RFE) is an SVM-based iterative procedure that generates a nested sequence of gene subsets whereby the subset obtained at iteration k+1 is contained in the subset obtained at iteration k. The genes that are kept per iteration correspond to genes that have the largest weight magnitudes—the rationale being that genes with large weight magnitudes carry more information with respect to class discrimination than those genes with small weight magnitudes.
Implementation of RFE algorithm: The rate of reduction in the number of genes via the RFE algorithm typically been geometric in nature (18,19). For example, in (18), 50% of the genes were removed per RFE iteration. However, as in (19), we have taken a less aggressive pruning approach with respect to the number of genes being removed per RFE iteration. In this work, the number of genes removed was constant within blocks of intervals: from 8000 to 1000 genes, 1000 genes were removed per iteration; from 900 to 200 genes, 100 genes were removed per iteration, etc.
Leave-one-out cross-validation (LOOCV) was used to assess the performance of a linear SVM classifier. The LOOCV procedure divides the training samples into N disjoint sets where the ith set contains samples 1, . . . , i−1, i+1, . . . , N. The SVM classifier is then trained on the ith set and tested on the withheld ith sample. This process is repeated for each set and the LOOCV error is the overall number of misclassifications divided by N. Note that the RFE algorithm was performed separately on each leave-one-out fold—failure to do induces a selection bias that yields LOOCV error rates that are overly optimistic (20). If the benchmark for determining the number of genes to use in training the SVM classifier is based only upon RFE iterations with low LOOCV error, then one finds in practice many sets of gene numbers (e.g. 500, 100 or 50 genes) that satisfy this criterion. Using only the training set LOOCV error, there is no obvious way to choose which number of genes should be used a priori on the test set. Indeed, classifiers using different numbers of genes will often lead to inconsistent predictions on the test set.
Instead of choosing one subset of genes out of many as the definitive gene subset to be used on the test set, we instead use many subsets in a weighted voting scheme fashion. The gene subsets used corresponds to those sets with low LOOCV error. To determine the weight attributed to each subset of genes, metrics of classifier assessment other than LOOCV error were used. Once LOOCV has been performed, the SVM classifier is then retrained on the entire training set.
Let G={G1, . . . , Gr} denote the collection of gene subsets with low LOOCV error, where r is the number of gene subsets. The number of gene subsets, r, used in this study was determined by inspection. However, one can easily use LOOCV as a mechanism for determining r. Let fi(pj) denote the prediction of the ith set, Gi, for the jth patient, pj, in the test set. The final prediction for the jth patient, f(pj), consists of a linear combination of the predictions made by each set:
where αi is the weight attributed to each gene subset. In this work, αi is determined solely from the training set and consists of two components:
The SVM and RFE algorithms were written in MATLAB (21). The particular SVM algorithm used was based upon the Lagrangian SVM formulation of Mangasarian and Musicant (22). The RFE approach with the voting scheme extension achieved the highest test set accuracy on the majority of the tasks examined in this work. The best test accuracy was achieved for the AML/ALL classification task while the performance on the other tasks were slightly better than the “majority-class” results—the results obtained if one were to always vote with the majority class. This is not surprising since the AML/ALL class distinctions tend to “dominate” the gene expression behavior. Since SVMs are not dependent upon an a priori and external feature/gene reduction procedure and can efficiently fold feature selection into the classification process, they will continue to perform well on tasks where the class distinctions dominate the gene expression behavior. Non-linear SVMs were trained on several of the classification tasks, but their generalization performance on the test set, as expected, was far worse than the linear SVM classifiers. Since the patients already sparsely populate a very high-dimensional gene space, mapping to even higher-dimensional feature space via a nonlinear kernel will only exacerbate the dilemma of over fitting, a condition already made worse due to the disturbingly small size of the training set relative to the number of genes and the large amount of experimental noise associated with microarray-generated data in general.
2.3 Class Prediction by Linear Discriminant Analysis
Discriminant analysis is a widely used statistical analysis tool (23). It can be applied to classification problems where a training set of samples, depending on some set of feature variables, is available. The idea is to find a linear or non-linear function of the feature variables such that the value of the function differs significantly between different classes. The function is the so-called discriminant function. Once the discriminant function has been determined using the training set, we can predict the class that a new sample most likely belongs to.
Preprocessing: Not all of the original data ware used in our analysis of the infant leukemia dataset. We eliminated all control genes (those with accession ID starting with the AFFX prefix) and those genes with all calls ‘Absent’ for all 142 samples. With these genes removed from the original 12625, we were left with 8414 genes. In addition, a natural log transformation was performed on 8414×142 matrix of the gene expression values prior to further analysis.
Selection of Significant Discriminating Genes for Binary Classifications: We assumed that the discriminating genes will be those with the most statistically significant difference between the two classes in a given binary classification task. We evaluated each gene by checking if its expression value differed significantly between the two classes. This was done using the two-sample t-test. The larger the absolute value of the t-test statistic T, the greater the confidence that there is a difference between the expression values of the two classes. The significance of the difference can be measured via the corresponding p-value, which provides a straightforward means of ranking the genes in order of importance.
Class Prediction: Once the genes have been ranked using the p-value, we need to select a subset as our discriminant variables. The expression values of these genes in the training set are used to determine a linear discriminant function, which discriminates between the two classes and also defines a trained classifier for making the class predictions for each sample in the test set. The question is how to determine the optimal value for n. n must be less than the sample size of the training set, otherwise the covariance matrix of the samples in the training set will be singular and the discriminant function cannot be determined. Also, if n is too large the discriminant function may be over fitted to the data in the training set, which may lead to more misclassifications when it is used to make predictions in test set. On the other hand, if n is too small, then the information contained in the feature set may be not sufficient for making accurate predictions. In practice, different prediction outcomes result when different numbers n of prediction genes are used in the classifier. To determine the class of a given sample from the test set, we have therefore we have chosen to use a simple voting scheme. We make a series of predictions with the number n of prediction genes varying from ⅓ to ⅔ of the sample size of the training set. (For example, if the number of samples in the training set was 85, we computed predictions for the given sample from the test set using n=28, 29, 30, . . . , 56.) The dominant class predicted is then taken as the final prediction result for the sample. Overall, the results of our discriminant analysis for classification tasks were not as good as those of the other multivariate methods (fuzzy logic, Bayesian, SVM) applied to these problems.
2.4 Fuzzy Interference Classification Methodology
Traditional classification methods are based on the theory of crisp sets, where an element is either a member of a particular set or not. However many objects encountered in the real world do not fall into precisely defined membership criteria. Alternative forms of data classification, which allows for continuous membership gradations, have been investigated and introduced fuzzy logic theory (24).
In many applications, it is easier to produce a linguistic description of a system than a complex mathematical model. The advantage of fuzzy logic in these situations is its ability to describe systems linguistically through rule statements (25). Expert human knowledge can then be formulated in a systematic manner. For example, for a gene regulatory model, one rule statement might be: “If the activator A is high and the repressor B is low, then the target C would be high” (26).
A Fuzzy Inference System (FIS) contains four components: fuzzy rules, a fuzzifier, an inference engine, and a “defuzzifier” (27). The fuzzy rules, consisting of a collection of IF-THEN rules, define the behavior of the inference engine. The membership functions μF(x) provide measure of the degree of similarity of elements to the fuzzy subset.
In fuzzy classification, the training algorithm adapts the fuzzy rules and membership functions so that the behavior of the inference engine represents the sample data sets. The most widely used adaptive fuzzy approach is the neuro-fuzzy technique, in which learning algorithms developed for neural nets are modified so that they can also train a fuzzy logic system (28).
Preprocessing. The infant dataset we used consists of gene expression level for 12625 probesets on the Affymetrix U95Av2 chip, including 67 control genes, measured for 142 patients. The Affymetrix Microarray Suite (MAS) 5.0 assigns a “Present”, “Marginal”, or “Absent” call to the computed signal reported for each probeset [Affymetrix 2001]. Because of strong observed variations in the range of gene expression values across different experiments, it is necessary to preprocess the data prior to further analysis.
In the infant dataset, 17% of all the labels are “Present”, 81% are “Marginal”, and 2% are “Absent”. We prefer not to eliminate too many probesets at the outset. So we choose a loose rule to filter the probesets. We assume that “reliable probesets” satisfy the following criteria:
In any binary classification task, there are four possible prediction outcomes characterized as true-positive (TP), false-positive (FP), true-negative (TN) and false-negative (FN). In the former two instances, a sample is, respectively, correctly or incorrectly classified into Class A, while the latter two instances correspond to classification into Not-Class A. Consequently, the performance of a class predictor can always be completely summarized in terms of a 2×2 matrix as shown in Table 48.
Note that because each row sums to 1 only one quantity is required from each row in order to determine the entire matrix. In other words, there are only two independent quantities in Table 48. These can be regarded as evaluating the different aspects of the class predictor's performance. Improving a class predictor's performance in TP may lower its TN, while its TN may be improved at the cost of reducing of its TP. In order to evaluate the overall performance of a class predictor, therefore, a measure that combines the two independent quantities is needed.
We considered two such overall measures: the success rate r, and the odds ratio OR. The success rate is defined as the probability of correct prediction. This is just a weighted average of TP and TN:
r=w1TP+w2TN, [1]
where w1=actual proportion of Class A in the test set, and w2=1−w1. TP and TN are intrinsic values associated with a given predictor, and are unknown; therefore r is also unknown and must be estimated. A commonly used point estimate of r, which we have utilized here, is the ratio of the number of correct predictions to the total number of predictions. We have also computed the 95% confidence intervals of r (35). Finally, we have performed a significance test to evaluate the extent to which the performance of a predictor differs from what would have been obtained by chance alone. This is equivalent to testing the statistical hypotheses
H0:r=0.5 verses HA:r>0.5. [2]
If the p-value (35) of the test is no larger than a given significance level α (here, we have set α=0.05 and α=0.01), then we reject the null hypothesis H0 and conclude that the difference is significant at level α. The p-value is closely related to the success rate: the larger the success rate, the smaller the expected p-value. Thus, either success rate or the p-value can be used to measure the performance of a predictor. For each of four class predictors, and with respect to each of thirteen tasks, we have computed the point estimate and confidence interval of r. These are presented in Table 48, along with the p-value corresponding to the statistical test of hypotheses [2].
The second overall measure that we utilized is the odds ratio (OR). Since a good class predictor should simultaneously satisfy
TP>FN and FP<TN, [3]
or equivalently,
TP/FN>1 and FP/TN<1, [4]
this implies that the ratio of the right hand sides of the inequalities in [4], i.e.,
should be large (at least larger than 1). Hence this ratio—known as the odds ratio (29)—can be utilized as an overall measure for evaluating the class predictor's performance. For each of the four class predictors and each of the thirteen tasks, the estimated value of OR and its 95% exact confidence interval (36) have been calculated through the use of SAS package (37), and the results are listed in Table 49.
Above, we observed that the expected values for the TP and FP of a good class predictor should satisfy TP>FP or TP/FP>1, which is mathematically equivalent to OR>1. This suggests that the performance of a classifier can alternatively be evaluated by testing the following hypotheses:
H0:TP<FP vs. HA:TP>FP, [6]
or equivalently
H0: OR<1 vs. HA:OR>1. [7]
Hence the p-value of the test also serves as a good measure for evaluating the performance of the class predictor. An uniformly most powerful unbiased test—known as Fisher's exact test (38)—has been used to test the hypotheses [7] and the p-values of the test are given in Table 49.
From Tables 48 and 49 it is evident that all of the four class predictors performed well on Tasks 1 and 3. The statistical test for hypotheses [2] rejects the null hypothesis H0 and we may conclude that the predictions made by the four class predictors on these tasks are significantly better than those made by chance, at level α=0.01. Fisher's exact test yields the similar results, except that for two of the predictors (fuzzy inference and discriminant analysis), the significance level for Task 3 predictions is α=0.05.
KEY:
r = Estimate of the success rate of the class predictor.
C.I. = 95% confidence interval of the success rate of the class predictor.
p-value = p-value of hypothesis test [2] (see text).
*means that r > 0.5 at significance level α = 0.05.
**means that r > 0.5 at significance level α = 0.01.
KEY:
OR = Estimate of the odds ratio.
C.I. = 95% confidence interval of the odds ratio.
p-value = p-value of Fisher's exact test.
*means that OR > 1 at significance level α = 0.05.
**means that OR > 1 at significance level α = 0.01.
4. Unsupervised Methods—Clustering Methodology
Three types of methodologies were used in the clustering analysis, namely agglomerative hierarchical clustering, Principal Component Analysis and a force-directed clustering algorithm coupled with the VxInsight visualization tool.
4.1 Agglomerative Hierarchical Clustering
The grouping together, or clustering, of genes with similar patterns of expression is based on the mathematical measure of their similarity, e.g. the Euclidian distance, angle or dot products of the two n-dimensional vectors of a series of n measurements. Biological interpretation of DNA microarray hybridization gene expression data has utilized clustering to re-order genes, and conversely samples into groups which reflect inherent biological similarity. Clustering methods can be divided into two classes, supervised and unsupervised. In supervised clustering vectors are classified with respect to known reference vectors. Unsupervised clustering uses no defined vectors. With a diverse dataset of 126 infant leukemia patients and our intent to discover unique patterns within, we chose to use an unsupervised clustering approach. In addition, combining the ordered list of genes and patients with a graphical presentation of each data point using relative value-color, termed a “heat map”, aids the viewer in an intuitive manner. Several computer software programs allow one to cluster significant samples and genes and create graphical output (Cluster, Genespring, GeneCluster).
We have applied the Eisen (39) Cluster algorithm utilizing pair wise average-linkage cluster analysis to gene expression data from Affymetrix U95Av2 arrays. Genes were selected for this analysis if the Affymetrix Microarray Analysis Software v. 5.0 predicted at least 1 of 126 patient data were “Present”. The resulting 8,358 genes were z-scored across patients and the standard deviation determined. The clustering algorithm of genes is as follows: the distance between two genes is defined as 1−r where r is the correlation coefficient between the 252 values of the two genes across samples. Two genes with the closest distance are first merged into a super-gene and connected by branches with length representing their distance, and are deleted from future merging. The expression level of the newly formed super-gene is the average of standardized expression levels of the two genes (average-linked) across samples. Then the next super-gene with the smallest distance is chosen to merge and the process repeated 8,352 times to merge all 8,353 genes.
4.2 Principal Component Analysis
Principal component analysis (PCA) is a well-known and convenient method for performing unsupervised clustering of high-dimensional data. Closely related to the Singular Value Decomposition (SVD), PCA is an unsupervised data analysis technique whereby the most variance is captured in the least number of coordinates (40-42). It can serve to reduce the dimensionality of the data while also providing significant noise reduction. PCA can also be applied to gene-expression data obtained from microarray experiments. When gene expressions are available from a large number of genes and from numerous samples, then the noise suppression and dimension reduction properties of PCA can greatly facilitate and simplify the examination and interpretation of the data. In any microarray experiment, the expression profiles of many genes are monitored simultaneously. Because many genes are often up or down regulated in similar patterns in the cells, these responses are correlated. PCA can identify the uncorrelated or independent sources of variation in the gene expression data from multiple samples. Since random noise tends to be uncorrelated with the signal, PCA does an effective job at separating the signal from the noise in the data.
If the gene expression values from each microarray are written as row vectors, then the entire data set from multiple microarray samples can be represented by a data matrix whose rows represent the gene expressions from each microarray chip. PCA can greatly reduce the complexity and dimensionality of the data by factor analyzing the data matrix into the product of two much smaller matrices. The two smaller matrices are known as scores and loading vectors (or eigenvectors). The decomposition is often achieved with a method known as singular value decomposition (SVD). PCA has the unique property that the decomposition is performed such that the rows of the score matrix are orthogonal and the columns of the eigenvector matrix are also orthogonal. Although there is a strict mathematical definition of orthogonal, orthogonal vectors are simply independent and uncorrelated with one another. Therefore, these vectors represent unique sources of variation in the microarray data. Another property of the eigenvectors is that they are calculated such that the first eigenvector represents the largest source of variance in the data, the second represents the next largest unique source of variance in the data, and so on. Since we generally expect the signal in the data to be larger than the noise and since random noise is approximately orthogonal to the signal, PCA has the ability to separate the noise from signal that we are interested in. By ignoring the eigenvectors with low variance, we can observe the portion of the data that contains primarily signal.
The scores matrix represents the amounts of each eigenvector in each sample that are required to reproduce the data matrix. When we eliminate the noisier eigenvectors we also eliminate their associated scores. The scores represent a compressed form of the data matrix in the new coordinate system of the eigenvectors. Since scores are derived from the expression of many genes and many samples, they have much higher signal-to-noise ratios than the individual gene expressions upon which they are based. A plot of the scores for each microarray for each eigenvector then is a new compressed form of the gene expression data for all samples. 2D plots of one set of scores vs. another for two selected eigenvectors allow us an examination of the microarray data in the compressed PCA space so that we can readily observe clusters in expression data. 3D plots are also possible when the scores from three selected eigenvectors are displayed. Statistical metrics can be used to identify groupings or clusters in the data in 2, 3, or higher dimensions that cannot be readily viewed graphically. All the statistical supervised and unsupervised clustering methods that are based on individual genes or groups of genes can be applied to the scores representation of the data.
The first three Principal Components partition the infant cohort into two different groups. Interestingly, these groups display a weak correlation with the infant ALL/AML lineage membership (and none with the MLL cytogenetics), although the correlation is not seen until the second PC. This indicates, according to the theory behind PCA, that the ALL/AML distinction is not the driving force behind the representation of the patient cohort. The first (and most important) Principal Component, on the other hand, does not reveal any obvious clusters. Upon further analysis, however, we did find an additional interesting group correlated with the first Principal Component. This group was discovered by a force-directed graph layout algorithm and the VxInsight® visualization program (43, 44).
4.3 VxInsight and the Force Directed Clustering Algorithm
This clustering algorithm places genes into clusters such that the sum of two opposing forces is minimized. One of these forces is repulsive and pushes pairs of genes away from each other as a function of the density of genes in the local area. The other force pulls pairs of similar genes together based on their degree of similarity. The clustering algorithm stops when these forces are in equilibrium. Every gene has some correlation with every other gene; however, most of these are not strong correlations and may only reflect random fluctuations. By using only the top few genes most similar to a particular gene as it is placed into a cluster we obtain two benefits. First, the algorithm runs much faster. Second, as the number of similar genes is reduced, the average influence of the other, mostly uncorrelated genes diminishes. This change allows the formation of clusters even when the signals are quite weak. However, when too few genes are used in the process, the clusters break up into tiny random islands, so selecting this parameter is an iterative process. One trades off confidence in the reliability of the cluster against refinement into sub-clusters that may suggest biologically important hypotheses. These clusters are only interpreted as suggestions, and require further laboratory and literature work before we assign them any biological importance. However, without accepting this trade off, it may be impossible to uncover any suggestive structure in the collected data. For example, we clustered using the twenty other genes most strongly similar to each gene. When we re-cluster using only the top ten most strongly similar genes, the observed clusters have broken up into smaller groups. We carefully analyzed these for biological support and believe that they may be suggestive of weak, but important groupings in our experimental data. VxInsight was employed to identify clusters of patients with similar gene expression patterns, and then to identify which genes strongly contributed to the separations. That process created lists of genes, which when combined with public databases and research experience, suggest possible biological significances for those clusters. The array expression data were clustered by rows (similar genes clustered together), and by columns (patients with similar gene expression clustered together). In both cases Pearson's R was used to estimate the similarities. These similarities were used together with a force-directed, two-dimensional clustering algorithm (43, 44) to produce maps showing clusters of genes and patients. Different maps were generated by using the top twenty, top ten and top five strongest correlations for each gene (using more similarity links between genes generates more stable clusters, while using fewer links leads to finer, if less stable, divisions). This methodology has been useful in inferring functions of uncharacterized genes clustered near other genes with known functions (45, 46), and did contribute to our analysis here, too. However, patients were the main focus of this study and most of the analysis revolved around the map of patient clusters. Analysis of variance (ANOVA) was used to determine which genes had the strongest differences between pairs of patient clusters. These gene lists were sorted into decreasing order based on the resulting F-scores, and were presented in an HTML format with links to the associated OMIM pages, which were manually examined to hypothesize biological differences between the clusters.
We also investigated the stability of those gene lists using statistical bootstraps (47, 48). For each pair of clusters we computed 1000 random bootstrap cases (resampling with replacement from the observed expressions) and computed the resulting ordered lists of genes using the same ANOVA method as before. The average order in the set of bootstrapped gene lists was computed for all genes, and reported as an indication of rank order stability (the percentile from the bootstraps estimates a p-value for observing a gene at or above the list order observed using the original experimental values). Because the force directed placement algorithm used by VxInsight has a stochastic element (random initial starting conditions), we used massively parallel computers to calculate hundreds of reclustering with different seeds for the random number generator. We compared pairs of ordinations by counting, for every gene, the number of common neighbors found in each ordination. Typically, we looked in a region containing the 20 nearest neighbors around each gene, in which case one could find (around each gene) a minimum of 0 common neighbors in the two ordinations, or a maximum of 20 common neighbors. By summing across every one of the genes an overall comparison of similarity of the two ordinations can be computed. We computed all pair wise comparisons between the randomly restarted ordinations and found the ordination that had the largest count of similar neighbors across the totality of all the comparisons. Note that this corresponds to finding the ordination whose comparison with all the others has minimal entropy, and in a general sense represents the most central ordination (MCO) of the entire set. It is possible to use these comparison counts (or entropies) as similarity measures to compute another round of ordinations. The clusters from this recursive use of the ordination algorithm are generally smaller, much tighter, and are generally more stable with respect to random starting conditions than any single ordination. We used all of these methods during exploratory data analysis to develop intuition about the data.
5. Lists of Informative Genes
Homo sapiens cDNA: clone HEP03585
Homo sapiens mRNA for TL132
Homo sapiens, clone IMAGE: 4391536, mRNA
Homo sapiens aconitase precursor (ACON) mRNA,
Homo sapiens cDNA FLJ32313 fis, clone PROST
Homo sapiens low density lipoprotein receptor
Homo sapiens mRNA full length insert cDNA
Homo sapiens clone 23579 mRNA sequence
Homo sapiens aconitase precursor (ACON)
Homo sapiens mRNA full length insert cDNA clone
Homo sapiens mRNA; cDNA DKFZp586J101
6. Additional Explorations on VxInsight Clustering Results with the Genetic Algorithm K-Nearest Neighbor Method (GA/KNN).
As it was previously mentioned, the VxInsight clustering algorithm identified three major groups, A, B, and C, in the infant leukemia dataset. We hypothesized these groups correspond to distinct biologic clusters, correlated with unique disease etiologies. Several approaches were used to evaluate cluster stability and to determine genes that discriminate between the clusters. In order to test how well these three clusters can be distinguished using supervised classification and cross-validation methods (49) we used a genetic algorithm training methodology to perform feature selection using a simple K-nearest neighbor classifier (50, 51). This approach was applied using VxInsight cluster train/test class labels, creating three implied one-vs.-all classification problems (A vs. B+C, etc.) The “top 50” discriminating gene lists are reported for each problem, and compared with previously obtained ANOVA gene lists.
To compare this “top 50” gene lists with the gene lists generated using ANOVA, we used a one-vs-all-others (OVA) approach to form three binary classification problems: a) A vs. BC; b) B vs. CA; c) C vs. AB. Based on our subsequent numerical results (time to solution for the genetic algorithm), Task (a) appears to have been the easiest and Task (b) the hardest. We also did three-way classification for VxInsight groups. It is Task (d).
6.1. GA/KNN Procedure and Parallel Program Parameters
The Genetic Algorithm (GA) K Nearest Neighbor (KNN) method (50, 51) is a supervised feature selection method based on the non-parametric k-nearest neighbor classification approach (52). GA uses a direct analogy of natural behavior and works with a “population” of “chromosomes.” Each chromosome represents a possible solution to a given problem. A chromosome is assigned a fitness score according to how good a solution to the problem it is. Highly fit individuals are given opportunities to “reproduce,” by “cross breeding” with other individuals in the population. This produces new individuals (offspring), which share some features taken from each parent. The least fit members of the population are less likely to get selected for reproduction, and so die out. Selecting the best individuals from the current “generation” and mating them to produce a new set of individuals produce an entirely new population of possible solutions. This new generation contains a higher proportion of the characteristics possessed by the good members of the previous generation. In this way, over many generations, good characteristics are spread throughout the population, being mixed and exchanged with other good characteristics. The fitness of each chromosome is determined by its ability to classify the training set samples according to the KNN procedure. In KNN, each sample was classified according to its k nearest neighbors, using the Euclidean distance metric in d-dimensional space (d is the number of probesets in the expression profile for a given patient sample). In our initial experiments, we have chosen k=3. In consensus rule, if all of the k nearest neighbors of a sample belong to the same class, the sample is classified as that class; otherwise, the sample is considered unclassifiable. In majority rule, if more than half of the k nearest neighbors of a sample belong to the same class, the sample is classified as that class; otherwise, the sample is considered unclassifiable.
The GA/KNN methodology was implemented as a C/MPI parallel program on the LosLobos Linux supercluster. The program terminates when 2000 good solutions have been obtained. Following this initial processing, the frequency with which each probeset was selected was analyzed.
The parameters used were as follows:
See Table 61. Here pVal1 is p-value of testing whether the SR is larger than 0.5 and pVal2 is p-value of testing whether the OR is larger than 1. Both pVal1s and pVal2s are very small (<<0.05) for our predictions. So they are significant.
4) Classification Results with DIFF Genes
The numbers of DIFF calls are 46, 32, and 36 in top 50 discriminating genes, for A vs. BC, B vs. CA, and C vs. AB respectively. We did classification only based on DIFF genes, for A vs. BC, B vs. CA, and C vs. AB respectively. Unfortunately, no improvement of SRs was observed for test dataset (Table 62).
Homo sapiens mRNA; cDNA DKFZp586F1322
Homo sapiens cDNA FLJ30217 fis, clone BRACE2001709,
Homo sapiens clone 24775 mRNA sequence
Summary
Translocations involving the MLL (ALL-1, HRX, Htrx-1) gene at chromosome band 11q23 are the most common cytogenetic abnormality seen in infant leukemia. While there is evidence that MLL-associated chromosomal rearrangements carry a poorer prognosis, the pathogenesis and unique gene expression for each MLL rearrangement remain largely undefined. Using oligonucleotide arrays (Affymetrix U95Av2) and both unsupervised and supervised analysis methods we derived comprehensive gene expression profiles from a retrospective cohort of 126 infant cases registered to NCI-sponsored clinical trials. Fifty-three of those cases had MLL rearrangements with several partner genes (AF4, ENL, AF10, AF9 and AF1Q). We used class identification methods (Bayesian networks, Support Vector Machines and Discriminant Analysis) to determine genes with common patterns of expression across all the MLL cases as well as genes that were uniquely expressed and distinguishing of each MLL translocation variant. However, class discovery tools suggested that the MLL-associated profiles were quite heterogeneous among different translocation variants and were dominated by three differential expression patterns. Interpretation of our data indicated that infant MLL is an entity comprising several intrinsic biologic classes not precisely predicted by current standards of morphology, immunophenotyping, or cytogenetics. Consideration of such class-membership could improve classification schemes and reveal potential therapeutic targets for MLL-associated leukemias.
Introduction
In Example XIII, we analyzed the gene expression profiles in samples of 126 infant acute leukemia patients. Three inherent biologic subgroups were identified. These groups were not well defined by traditional cell types (AML vs. ALL) or cytogenetic (MLL vs. not) labels. Instead, they reflected different etiologic events with biological and clinical relevance. The distribution of the MLL infant cases between those “etiology-driven” clusters is the focus of this study.
Materials and Methods
For this study we analyzed 126 diagnostic bone marrow samples from patients with acute leukemia who were aged <1 year at diagnosis. In each case, the percentage of blast was >80%. The cohort was designed from cases registered to NCI-sponsored Infant Oncology Group/Children's Oncology Group treatment trials number 8398, 8493, 8821, 9107, 9407 and 9421. Of the 126 cases, 78 (62%) were acute lymphocytic leukemia (ALL) and 48 (38%) were acute myeloid leukemia (AML) by standard morphological and immunophenotypic criteria. Fifty-three (42%) cases had translocations involving the MLL gene (chromosome segment 11q23). An average of 2×107 cells were used for total RNA extraction with the Qiagen RNeasy mini kit (Valencia, Calif.). The yield and integrity of the purified total RNA were assessed using the RiboGreen assay (Molecular Probes, Eugene, Oreg.) and the RNA 6000 Nano Chip (Agilent Technologies, Palo Alto, Calif.), respectively. Complementary RNA (cRNA) target was prepared from 2.5 μg total RNA using two rounds of Reverse Transcription (RT) and In Vitro Transcription (IVT). Following denaturation for 5 min at 70° C., the total RNA was mixed with 100 pmol T7-(dT)24 oligonucleotide primer (Genset Oligos, La Jolla, Calif.) and allowed to anneal at 42° C. The mRNA was reverse transcribed with 200 units Superscript II (Invitrogen, Grand Island, N.Y.) for 1 hr at 42° C. After RT, 0.2 vol 5× second strand buffer, additional dNTP, 40 units DNA polymerase I, 10 units DNA ligase, 2 units RnaseH (Invitrogen) were added and second strand cDNA synthesis was performed for 2 hr at 16° C. After T4 DNA polymerase (10 units), the mix was incubated an additional 10 min at 16° C. An equal volume of phenol:chloroform:isoamyl alcohol (25:24:1) (Sigma, St. Louis, Mo.) was used for enzyme removal. The aqueous phase was transferred to a microconcentrator (Microcon 50, Millipore, Bedford, Mass.) and washed/concentrated with 0.5 ml DEPC water until the sample was concentrated to 10-20 ul. The cDNA was then transcribed with T7 RNA polymerase (Megascript, Ambion, Austin, Tex.) for 4 hr at 37° C. Following IVT, the sample was phenol:chloroform:isoamyl alcohol extracted, washed and concentrated to 10-20 ul. The first round product was used for a second round of amplification which utilized random hexamer and T7-(dT)24 oligonucleotide primers, Superscript II, two RNase H additions, DNA polymerase I plus T4 DNA polymerase finally and a biotin-labeling high yield T7 RNA polymerase kit (Enzo Diagnostics, Farmingdale, N.Y.). The biotin-labeled cRNA was purified on Qiagen RNeasy mini kit columns, eluted with 50 ul of 45° C. RNase-free water and quantified using the RiboGreen assay. Following quality check on Agilent Nano 900 Chips, 15 ug cRNA were fragmented following the Affymetrix protocol (Affymetrix, Santa Clara, Calif.). The fragmented RNA was then hybridized for 20 hours at 45° C. to HG_U95Av2 probes. The hybridized probe arrays were washed and stained with the EukGE_WS2 fluidics protocol (Affymetrix), including streptavidin phycoerythrin conjugate (SAPE, Molecular Probes, Eugene, Oreg.) and an antibody amplification step (Anti-streptavidin, biotinylated, Vector Labs, Burlingame, Calif.). HG_U95Av2 chips were scanned at 488 nm, as recommended by Affymetrix. The expression value of each gene was calculated using Affymetrix Microarray Suite 5.0 software.
Data Presentation and Exclusion Criteria
Criteria used as quality controls included: total RNA integrity, cRNA quality, array image inspection, B2 oligo performance, and internal control genes (GAPDH value greater than 1800). Of the initial cohort of 142 infant acute leukemia cases, 126 were finally part of this study.
Data Analysis
Affymetrix MAS 5.0 statistical analysis software was used to process the raw microarray image data for a given sample into quantitative signal values and associated present, absent or marginal calls for each probe set. A filter was then applied which excluded from further analysis all Affymetrix “control” genes (probe sets labelled with AFFX—prefix), as well as any probe set that did not have a “present” call at least in one of the samples. This filtering step reduced the number of probe sets from 12625 to 8414, resulting in a matrix of 8,414×126 signal values. Our Bayesian classification and VxInsight clustering analyses omitted this step; choosing instead to assume minimal a priori gene selection, as described in Helman et al., 2002 and Davidson et al., 2001. The first stage of our analysis consisted of a series of binary classification problems defined on the basis of clinical and biologic labels. The nominal class distinctions were ALL/AML, MLL/not-MLL, and achieved complete remission CR/not-CR. Additionally, several derived classification problems were considered based on restrictions of the full cohort to particular subsets of the data (such as the VxInsight clusters). The multivariate supervised learning techniques used included Bayesian nets (Helman et al., 2002) and support vector machines (Guyon et al., 2002). The performance of the derived classification algorithms was evaluated using fold-dependent leave-one-out cross validation (LOOCV) techniques. These methods allowed the identification of genes associated with remission or treatment failure and with the presence or absence of translocations of the MLL gene across the dataset.
In order to identify potential clusters and inherent biologic groups, a large number of clinical co-variables were correlated with the expression data using unsupervised clustering methods such as hierarchical clustering, principal component analysis and a force-directed clustering algorithm coupled with the VxInsight visualization tool. Agglomerative hierarchical clustering with average linkage (similar to Eisen et al., 1998) was performed with respect to both genes and samples, using the MATLAB (The Mathworks, Inc.), MatArray toolbox, as well as the native MATLAB statistics toolbox. The data for a given gene was first normalized by subtracting the mean expression value computed across all patients, and dividing by the standard deviation. The distance metric used for the hierarchical clustering was one minus Pearson's correlation coefficient. This metric was chosen to enable subsequent direct comparison with the VxInsight cluster analysis, which is based on the t-statistic transformation of the correlation coefficient (Davidson et al., 2001).
The second clustering method was a particle-based algorithm implemented within the VxInsight knowledge visualization tool. In this approach, a matrix of pair similarities is first computed for all combinations of patient samples. The pair similarities are given by the t-statistic transformation of the correlation coefficient determined from the normalized expression signatures of the samples (Davidson et al., 2001). The program then randomly assigns patient samples to locations (vertices) on a two dimensions graph, and draws lines (edges) linking each sample pair, assigning each edge a weight corresponding to the pairwise t-statistic of the correlation. The resulting two-dimensional graph constitutes a candidate clustering. To determine the optimal clustering, an iterative annealing procedure is followed. In this procedure a ‘potential energy’ function that depends on edge distances and weights is minimized by following random moves of the vertices (Davidson et al., 1998, 2001). Once the 2D graph has converged to a minimum energy configuration, the clustering defined by the graph is visualized as a 3D terrain map, where the vertical axis corresponds to the density of samples located in a given 2D region. The resulting clusters are robust with respect to random starting points and to the addition of noise to the similarity matrix, evaluated through effects on neighbour stability histograms (Davidson et al., 2001).
Results
Expression Profiling Demonstrates Heterogeneity Across Infant MLL Cases
The determine the variations in gene expression profiles of infant leukemia cases involving different MLL rearrangements, 126 infant leukemia cases registered to NCI-sponsored Infant Oncology Group/Children's Oncology Group treatment trials were studied using oligonucleotide microarrays containing 12,625 probe sets (Affymetrix U95Av2 array platform). Of the 126 cases, fifty-three (42%) cases had translocations involving the MLL gene (chromosome segment 11q23). The distribution of the MLL cytogenetic abnormalities across this data set is provided in Table 63.
The initial examination of the data was accomplished using the force directed clustering algorithm coupled with the visualization tool, (Davidson et al., 1998; 2001). When applied to the infant cohort, this particle-based clustering algorithm demonstrated the existence of three well-separated groups of patients that displayed similar patterns of gene expression (
An important finding of the present study is that two very distinct groups of gene expression profiles could be identified across cases with the same t(4;11) rearrangement (VxInsight clusters A and B). Using ANOVA, a gene list that characterizes the t(4;11) groups within the infant clusters A and B was derived (
Gene Expression Patterns of Different MLL Translocations
The second method used in our analysis was aimed at uncovering sets of genes that characterized each one of the MLL translocations. The process of defining the best set of discriminating genes was accomplished using supervised learning techniques such as Bayesian Networks, Linear Discriminant Analysis and Support Vector Machines (SVM) (Reviewed in Orr, 2002). In contrast with unsupervised methods, supervised learning methods learn “known classes”, creating classification algorithms that may undercover interesting and novel therapeutic targets. Our characterization of the gene expression profiles per MLL variant and the genes involved in these translocations accomplished using supervised learning techniques is shown in
Gene expression profiles characteristic of the t(4;11) and other MLL translocations are shown in
Expression Levels of FLT3 Across Various MLL Translocations
Expression levels of the FMS-related tyrosine kinase 3 (FLT3) gene were analyzed across different MLL translocations. FLT3, a member of the receptor tyrosine kinase (RTK) class III, is preferentially expressed on the surface of a high proportion of acute myeloid leukemia (AML) and B-lineage acute lymphocytic leukemia (ALL) cells in addition to hematopoietic stem cells, brain, placenta and liver (Kiyoe, 2002). Within MLL subgroups FLT3 is variable. The expression levels for this gene were differentially higher in t(4;11), t(11;19), t(9;11) and other MLL translocations (
Discussion
Gene expression profiling of our infant MLL leukemia cases revealed new insights into infant leukemia classification that may increase our understanding of the pathogenesis and hence, treatment options for this disease.
While groups of genes uniquely associated with each MLL translocation variant can be identified using supervised learning techniques (as previously shown by others), infant acute MLL leukemia seems to be an entity comprised of several intrinsic biologic clusters not precisely predicted by current standards of morphology, immunophenotyping, or cytogenetics. Unsupervised analysis demonstrated that gene expression in specific MLL rearrangements varied significantly amongst the three infant groups. As these intrinsic clusters appeared to relate to distinct subtypes of infant leukemia, the various MLL translocations may represent a critical secondary transforming event for each biological group, conferring more defined tumor phenotypes. Alternatively, MLL translocations may be permissive for further genetic rearrangements that will strongly influence and define differential gene expression patterns. Our findings of heterogeneity of gene expression within and between MLL subtypes differ from previous reports suggesting more homogeneous gene expression (Armstrong, 2002). This probably reflects mainly the larger number of cases available to us for analysis. However, rigorous exclusion of unsatisfactory samples was also critical for the successful interpretation of the data.
Particular genes that can be selected by supervised methods as characterizing cases with MLL translocations, in the current study the presence or absence of MLL rearrangements did not define a distinct leukemia class during unsupervised learning analysis of the gene expression patterns of these infant patients. Despite the fact that supervised analysis of the microarray data can successfully segregate patients defined by traditional methods such as immunophenotyping and cytogenetics, results from these techniques are most useful in the identification of unanticipated similarities and diversities in individual patients and thus may be useful in augmenting risk-group stratification in the future. Further studies to enhance the ability to classify infant MLL subtypes according to shared pathways of leukemic transformation will have important implications for the development of new therapeutic approaches.
This application claims the benefit of U.S. Provisional Application Ser. Nos. 60/432,064; 60/432,077; and 60/432,078; all of which were filed Dec. 6, 2002; and U.S. Provisional Application Ser. Nos. 60/510,904 and 60/510,968, both of which were filed Oct. 14, 2003; and a U.S. Provisional Application entitled “Outcome Prediction in Childhood Leukemia” filed on even date herewith. These provisional applications are incorporated herein by reference in their entireties.
This invention was made with government support under a grant from the National Institutes of Health (National Cancer Institute), Grant No. NIH NCI U01 CA88361; and under a contract from the Department of Energy, Contract No. DE-AC04-94AL85000. The U.S. Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
60432064 | Dec 2002 | US | |
60432077 | Dec 2002 | US | |
60432078 | Dec 2002 | US | |
60510904 | Oct 2003 | US | |
60510968 | Oct 2003 | US | |
60527610 | Dec 2003 | US |