The present invention is related to methods for detecting leukemia cells by determining the expression profile of a group of markers. In particular, the type or subtype of leukemia cells in a sample is determined. Further, uses of the group of markers are disclosed and compositions comprising these markers.
In the present specification, a number of documents is cited. The disclosure content of these documents including manufacturers' manuals, is herewith incorporated by reference. This holds particular true for the documents such as gene accession numbers cited in Tables 43a, b, 44 and 45 providing the complete nucleotide sequence of marker genes/cDNAs. In other terms, by reciting these documents, applicant intends to incorporate the complete nucleotide/amino acid sequence of those markers where only a partial sequence has been identified in the appended Tables. It is also intended to include the (poly)peptide sequences translated from these nucleotide sequences within the disclosure content of the present specification.
Today leukemias are classified into four different groups or types: acute myeloid (AML), acute lymphatic (ALL), chronic myeloid (CML) and chronic lymphatic leukemia (CLL). Within these groups, several subcategories can be identified further using a panel of standard techniques as described below. The incidence of leukemias is increasing with age and is 5/100.000/year in AML, 1/100.000/year in ALL, 1/100.000 in CML and 6/100.000/year in CLL. Several methods for classification have to be applied at diagnosis and before treatment starts: cytomorphology and cytochemistry, multiparameter-immunophenotyping, cytogenetics including fluorescence in situ hybridization, and molecular techniques such as polymerase chain reaction (PCR). So far only a combination of these techniques allows a precise diagnosis which is necessary to apply state of the art treatment. As the exact diagnosis is mandatory for example in CML the detection of a specific cytogenetic abnormality, the translocation (9;22) or its molecular counterpart, the BCR/ABL rearrangement is required to establish the diagnosis of CML. While all patients with CML show a BCR-ABL-rearrangement and are therefore homogenous with regard to the primary genetic abnormality, in AML and
ALL at least 10-15 different subgroups have been identified on the morphological, genetical or molecular level. Also in CLL several subgroups can be clearly separated. These different subcategories in leukemias are associated with varying clinical outcome and therefore are the basis for different treatment strategies. The importance of highly specific classification may be illustrated in detail further for the AML as a very heterogeneous group of diseases.
Data from clinical trials showed that outcome of patients with AML differs in a broad range. Several parameters influencing prognosis have been identified. These can be assigned to different categories: patients' characteristics (i.e. age, comorbidity), therapy, and biology of the AML. Therefore, a lot of effort was invested to identify biological entities and to distinguish subgroups of AML which are associated with a favorable, intermediate or unfavorable prognosis, respectively. In order to allow a comparison between different studies a classification of AML was mandatory. In 1976 the FAB classification was proposed by the French-American-British co-operative group which was based on cytomorphology and cytochemistry in order to separate AML subgroups according to the morphological appearance of blasts in the blood and bone marrow. In addition, it was recognized that genetic abnormalities occurring in the leukemic blast had a major impact on the morphological picture and even more on the prognosis. So far, the karyotype of the leukemic blasts is the most important independent prognostic factor regarding response to therapy as well as survival. For clinical purposes karyotype analysis allows to discriminate between three major prognostic groups. A favorable outcome under currently used treatment regimens with cure rates from 50% up to 858 was observed in several studies in patients with a) t (8;21) (q22; q22) occurring in AML M2, b) inv (16) (p13q22) occurring in; AML M4eo and c) t(15;17) (q22; qll-12) occurring in AML M3/H3v. In contrast, chromosome aberrations with an unfavorable clinical course are −5/del(5q), −7/del(7q), inv(3)/t(3:31 and complex aberrant karyotypes with cure rates of only 10%. The remainder of AML patients are assigned to a prognostically intermediate group. This latter group is very heterogeneous because it includes patients with a normal karyotype as well as those with rare chromosome aberrations with yet unknown prognostic impact.
The sub-classification of leukemias becomes Increasingly important to guide therapy. Thus, the development of new, specific treatment approaches requires the identification of specific subtypes that may benefit from a distinct therapeutic protocol. It has already been shown in two entities that the development of specific drugs can improve outcome of distinct subsets of leukemia. One important example is the development of a new therapeutic drug (ST1571) for the treatment of chronic myeloid leukemia (ML): this designed molecule inhibits the CML specific chimeric tyrosine kinase BCR-ABL generated from the genetic defect observed in CML, the BCR-ABL rearrangement due to the translocation between chromosomes 3 and 22 (t(9;22) (q34; q11)). First data show that therapy response is dramatically higher in patients treated with this new drug as compared to all other drugs that had been used so far. Another example is the subtype of acute myeloid leukemia AML M3 and its variant M3v both with karyotype t[15;17) (q22; q11-12). The introduction of a new drug (all-trans retinoic acid—ATRA) has improved the outcome in this subgroup of patient from about 50% to 85% long-term survivors; As it is mandatory for these patients suffering from these specific leukemia subtypes to be identified as fast as possible so that the best therapy can be applied, diagnostics today must accomplish sub-classification with maximal precision. Not only for these subtypes but also for several other leukemia subtypes different treatment approaches could improve outcome. Therefore, rapid and precise identification of distinct leukemia subtypes is the future goal for diagnostics.
So far a combination of methods is necessary to obtain the most important information in leukemia diagnostics: Analysis of the morphology and cytochemistry of bone marrow blasts and peripheral blood cells is necessary to establish the diagnosis. In some cases the addition of immunophenotyping is mandatory to separate very undifferentiated AML from acute lymphoblastic leukemia and CLL. Leukemia subtypes investigated can be diagnosed by cytomorphology alone, only if an expert reviews the smears. However, a genetic analysis based on chromosome analysis, fluorescence in situ hybridization or RT-PCR and immunophenotyping is required in order to assign all cases in to the right category. The aim of these techniques besides diagnosis is mainly to determine the prognosis of the leukemia. A major disadvantage of these methods, however, is that viable cells are necessary as the cells for genetic analysis have to divide in vitro in order to obtain metaphases for the analysis. Another problem is the long time of 72 hours from receipt of the material in the laboratory to obtain the result. Furthermore, great experience in preparation of chromosomes and even more in analyzing the karyotypes is required to obtain the correct result in at least 90% of cases. These experts in their field are necessary for all other techniques mentioned above as well. Accordingly, standard diagnosis of leukemia uses a combination of complementary methods, is expensive, time-consuming, and requires experienced experts in the field. Methods that have to be combined are cytomorphology or histomorphology, multiparameter-immunophenotyping, cytogenetics, fluorescence in situ hybridization, and molecular genetics such as polymerase chain reaction based assays.
Using these techniques in combination, hematological malignancies in a first approach are separated into chronic myeloid leukemia (CML), chronic lymphoid (CLL), acute lymphoblastic (ALL), and acute myeloid leukemia (AML). Within the latter three disease entities several prognostically relevant subtypes have been established. As a second approach this further subclassification is based mainly on genetic abnormalities of the leukemic blasts and clearly is associated with different prognoses. Therefore, this subclassification is increasingly important to guide therapy. Furthermore, the development of new, specific treatment approaches requires precise identification of leukemia subtypes.
In a first study Golub et al. (Science 1999) showed that gene expression profiles can be used for class prediction and discriminated AML from ALL samples. However, for his analysis of acute leukemias the selection of the two different subgroups was performed using exclusively morphologic-phenotypical criteria. This was only descriptive and does not provide deeper insights into the pathogenesis or the underlying biology of the leukemia. The approach reproduces only very basic knowledge of cytomorphology and intends to differentiate classes. The data is not sufficient to predict prognostically relevant cytogenetic aberrations.
Thus, the technical problem underlying the present invention was to provide means for leukemia diagnostics which overcome the disadvantages of the prior art diagnostic methods.
The solution to said technical problem is achieved by providing the embodiments characterized in the claims. Accordingly, the present invention relates to a method of determining whether a patient sample contains leukemia cells or other cells comprising the steps of a) determining the expression profile of a group of markers in a patient sample and b) concluding from the expression profile whether the patient sample contains leukemia cells or other cells characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 3 to 6, tables 15 to 20, tables 29, 30, 41, or 42 and whereby the number of markers in the group is between one and the total number of markers listed in the tables 3 to 6, tables 15 to 20, and tables 29, 30, 41, or 42. In a particular embodiment thereof, the present invention pertains to a method wherein leukemia type and subtype are simultaneously determined whereby a microarray for the detection of the expression level of a marker or a group of markers is used.
It is important to note that in accordance with the invention in all pertaining embodiments any possible combination of markers, said markers being disclosed in the respective table or tables is encompassed within the scope of the invention.
As used herein, the term “expression” refers to the process by which mRNA or a polypeptide is produced based on the nucleic acid sequence of a gene. The process includes both transcription and translation, i.e. “expression” shall also include the formation of mRNA upon transcription.
In accordance with the present invention, the term “determining the expression profile” preferably refers to the determination of the level of expression, namely of said group of markers.
As used herein, the term “marker” refers to a DNA, in particular cDNA, or RNA or a fragment thereof or a protein or a fragment thereof which are in the case of RNA (or cDNA) formed upon transcription of a nucleotide sequence which is capable of expression. The nucleic acid molecule fragments refer to fragments preferably of at least 8 such as ten, twelve, fifteen or eighteen nucleotides in length representing a consecutive stretch of nucleotides of a gene, cDNA or mRNA such as of 20 or nucleotides that are, for example, further specified in the appended Tables or a complementary sequence thereto. In other terms, markers include any fragment (or complementary sequence thereto) of the sequences depicted in the appended tables as long as these fragments unambiguously identify the marker. Typical fragment lengths are provided above. The determination of the expression profile of markers may be effected at the transcriptional or translational level. In other terms, the method of the invention envisages the determination at the level of mRNA or at the protein level. Protein fragments such as peptides advantageously comprise at least 6 consecutive amino acids representative of the corresponding full length protein. 6 amino acids are generally recognized as the lowest peptidic stretch giving rise to a linear epitope recognized by an antibody, fragment or derivative thereof. Alternatively, the proteins or fragments thereof may be analysed using nucleic acid molecules specifically binding to three-dimensional structures (aptamers). In principle, the investigator may determine, in accordance with the method of the invention, whether a gene is expressed at all in a leukemic or other cell. Alternatively, an investigator may determine the difference in the expression level, for example, between a leukemic and a non-leukemic cell or between two or more different types or subtypes of leukemia. If the sample comprises only other, i.e. non-leukemia cells, then the patient's suffering from a leukaemia may safely be denied. Insofar, the above main embodiment is to be understood that if the presence of other cells is determined then this determination includes an assessment to the effect that only other cells but no leukemic cells are comprised in the sample. On the other hand, the determination of leukemic cells may include the further characterization of such cells including the differentiation status of the cells as well as the distinction from other types of cancer cells or other subtypes of leukaemia cells. Particular embodiments in this regard are further outlined herein below.
In accordance with the above, the present invention also contemplates methods where simply the assessment of leukaemia cells but not necessarily of other cells is effected. This holds true for all embodiments where the determination of other cells is mentioned. It is to be understood that with the exception of the possible determination of other cells, the steps of the various methods of the invention remain unchanged. Thus, the invention also relates to a method of determining whether a patient sample contains leukemia cells comprising the steps of a) determining the expression profile of a group of markers in a patient sample and b) concluding from expression profile whether the patient sample contains leukemia cells characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 3 to 6, tables 15 to 20, tables 29, 30, 41, or 42 and whereby the number of markers in the group is between one and the total number of markers listed in the tables 3 to 6, tables 15 to 20, and tables 29, 30, 41, or 42. Thus, the invention further relates to a method of determining whether a patient sample contains leukemia cells and at the same time or subsequently determining the type and subtype of leukemia cells, if leukemia cells are present, comprising the steps of a) determining the expression profile of a group of markers in a patient sample and b) concluding from the expression profile whether the patient sample contains leukemia cells and at the same time or subsequently determining the type and subtype of leukemia cells, if leukemia cells are present, characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 16 to 20 or table 29 or 30 and whereby the number of markers in the group is between one and the total number of markers listed in the tables 16 to 20 or table 29 or 30, to name two important embodiments of the invention.
Determination of the expression profile/levels may be effected by a variety of methods, depending on the nature of the marker. Thus, if the marker is mRNA, cDNA may be prepared into which a detectable label, such as a fluorescent, chemiluminescent, bioluminescent, radioactive (such as 3H or 32P) label is incorporated. Said detectably labelled cDNA, in single-stranded form, may then be hybridised, preferably under stringent or highly stringent conditions to a panel of single-stranded oligonucleotides representing different genes and affixed to a solid support such as a chip. Upon applying appropriate washing steps, those cDNAs will be detected or quantitatively detected that have a counterpart in the oligonucleotide panel. Various advantageous embodiments of this general method are feasible. For example, the mRNA or the cDNA may be amplified wherein it is, for quantitative assessments, preferable that the number of amplified copies corresponds relative to further amplified mRNAs or cDNAs to the number of, mRNAs originally present in the cell. Also, the cDNAs may be transcribed into cRNAs wherein only in the transcription step a label is incorporated into the nucleic acid and wherein the cRNA is employed for hybridisation. Alternatively, the table may be attached subsequent to the transcription step. Similarly, proteins from a cell or tissue under investigation may be contacted with a panel of aptamers or of antibodies or fragments or derivatives thereof. The antibodies etc. may be affixed to a solid support such as a chip. Binding of proteins indicative of a leukemia or a subtype of leukaemia may be verified by binding to a detectably labelled secondary antibody or aptamer. For the labelling of antibodies, it is referred to Harlow and Lane, “Antibodies, a laboratory manual”, CSH Press, 1988, Cold Spring Harbor. As regards further test assays and formats, it is referred to further embodiments of the invention as specified herein below as well as to the appended examples. In addition, a number of applicable assay formats are available in the art that can applied to the method of the invention without further ado. Specifically, a minimum set of proteins necessary for diagnosis of all leukemia types may be selected for creation of a protein array system to make diagnosis on a protein lysate of a diagnostic bone marrow sample directly. Protein Array Systems for the detection of specific protein expression profiles already are available (for example: Bio-Plex, BIORAD, München, Germany). For this application preferably antibodies against the proteins have to be produced and immobilized on a platform e.g. glasslides or microtiterplates. The immobilized antibodies can be labeled with a reactant specific for the certain target proteins as discussed above. The reactants can include enzyme substrates, DNA, receptors, antigens or antibodies to create for example a capture sandwich immunoassay.
The level of the expression of the “marker” is indicative of a leukemic condition, of a cell or an organism. The level of expression of a marker or group of markers is measured and is compared with the level of expression of the same marker or the same group of markers from other cells or samples. The comparison may be effected in an actual experiment or in silico. When the expression level also referred to as expression pattern or expression signature (expression profile) is measurably different, there is according to the invention a meaningful difference in the level of expression. Preferably the difference at least is 5%, 10% or 20%, more preferred at least 50% or may even be as high as 75% or 100%. More preferred the difference in the level of expression is at least 200%, i.e. two fold, at least 500%, i.e. five fold, or at least 1000%, i.e. 10 fold.
The present invention allows to diagnose a wide variety and at least 14 different clinically relevant leukemia subtypes. Therefore, the invention of a combination of marker genes and their specific expression level it is possible to substitute all other mandatory diagnostic approaches including the approach of Golub and colleagues (cytomorphology or histomorphology, multiparameter-immunophenotyping, cytogenetics, fluorescence in situ hybridization, and molecular genetics) in one single step with a specificity and sensitivity that had never been achieved in all other techniques used so far.
In more detail, based on biomathematical analysis of gene expression profiles a new method could be provided which forms the basis for designing and developing a novel diagnostic approach preferably based on microarray technology. Further, subsets of markers, preferably genes could be introduced which allow the determination of leukemia type and subtype. The method according to the invention abolishes today's standard procedures in diagnosis of leukemia. These standard diagnostic procedures require more and more centralized core facilities with both personal experts in the fields of cytomorphology, cytogenetics and molecular genetics and expensive lab equipment, which causes increasing costs for adequate diagnosis. The present invention provides novel cost-effective methods and diagnostic tools, which are less time consuming, easy to operate but nevertheless as accurate and safe as all standard methods combined today. The genes or sets of genes allows to assign clinical samples either as healthy or malignant simply based on their gene expression profiles. The genes, representative fragments thereof or transcription or translation products thereof form the basis for the methods of the invention or diagnostic tools, corresponding thereto. Furthermore, these genes etc. allow to predict the diagnoses based on the genetic abnormality of the expression pattern and to discriminate between different prognostic relevant entities. When comparing two groups of microarray experiments, Golub's method (Science 286 (1999), 531-537) sorts the genes with respect to the signal-to-noise ratio of gene x: Sx=(μ1-μ2)/(σ1+σ2), where μk and σk denote the mean expression and standard deviation of gene x in group k.
According to a specified number of “informative” genes the 20 best discriminating genes are selected. For each informative gene a decision limit is calculated as bx=(μ1+/μ2)/2. To classify a new sample of an independent test set, the gene expression levels of informative genes are taken and for each gene x and sample y a so-called vote is calculated as Vx=Sx(gxy−bx), where gxy denotes expression level of gene x in sample y. The votes of all informative genes are summed up (“weighted voting”) and depending upon the sign of this sum the new sample is classified as group 1 or group. 2. The confidence in the prediction is calculated as |ΣVx/Σ|Vx||.
To assess the significance of each gene, a permutation test is performed, which determines signal-to-noise ratios when class labels are permuted randomly.
To assess the robustness of the classifier, a leave-one-out crossvalidation is performed. Accuracy is the rate of correctly classified test samples.
The decision limit proposed by Golub does not provide optimal classification accuracy in all situations. When the standard deviation of expression levels within the two groups are very different, the decision limit is biased towards the group with the higher standard deviation.
A decision limit for a particular gene can be considered optimal, if it achieves maximum classification accuracy for a given dataset. By determining systematically classification accuracies for a set of possible decision limits, an optimal decision limit can be calculated. The underlying statistics as described in Example 3 select an optimal decision limit from the following set of decision limits Lx:
Lx={(gxy+gxy−1)/2|1<y<=n}
where gxy denotes expression level of gene x in sample y, n denotes the total number of samples in the training set.
Golubs method selects an arbitrary number of “informative” genes to discriminate between two classes of samples according to their signal-to-noise ratio, typically in the range of 10 to 50 genes.
Choosing too many genes like in Golub's method carries the risk of overfitting, which causes poor generalization features of the model.
Therefore the present invention applies an heuristic approach to select a minimal set of discriminative genes, which provides maximum classification accuracy in leave-one-out-crossvalidation. I.e. for a given set of genes weighted voting as described by Golub is applied and the classification accuracy is calculated by crossvalidation used in accordance with the present invention and representing a further embodiment in accordance with this invention.
The method for achieving this used in accordance with the present invention and representing a further embodiment in accordance with this invention consists of the following steps:
If the gene improves accuracy and confidence, it is added to the weighted voting model, otherwise it is discarded.
Preferably, the decision limit is set according to the formula recited above.
In a pilot study consisting of 103 Affymetrix Genechip microarrays with 12625 genes each as shown in the appended examples we compared the results achieved with Golub's method and with our extended method.
Table A presents an analysis of 18 samples class A versus 85 samples class non-A. Based on 20 informative genes Golub's method results in a crossvalidation accuracy of 0,87 (confidence 0,77); achieves with three genes out of the top 20 set a crossvalidation accuracy of 0,96 (confidence 0,88).
The same analysis was performed for one versus all (OVA) and all pairs (AP) comparisons in this dataset consisting of 5 different classes.
The development of a leukemia diagnostic tool, preferably microarray based, allows for all patients which are preferably humans and specimens a reproducible, highly specific and rapid method to obtain important information for treatment strategies in leukemia. This technique can be established in every laboratory using basic methods of molecular biology, and preferably makes use of hybridization and amplification such as PCR or LCR based techniques and does not require hematologists or cytogeneticists with several years of experience in leukemia diagnostics. Material for the analysis can be sent over large distances as it is not necessary that cells arrive viable in the laboratory. Therefore, a centralization of leukemia diagnostics with very high quality is possible.
Moreover, the accumulation of an immense knowledge about gene expression profiles in leukemia types and subtypes, which are not characterized by specific genetic abnormalities, leads to a more precise classification compared to all other methods used so far. In addition, the data compiled in accordance with the invention are helpful for the understanding of the pathogenesis of leukemia and will allow to identify genes which are specifically dysregulated. They may be considered as potential targets for therapeutic interventions specifically designed for the different leukemia subtypes.
Preferably the method according to the invention is characterized in that the group of markers consists of between two, such as three, four, five, six, seven, eight, nine or ten and the total number of markers listed in one or more of the tables 3 to 6, tables 15 to 20, and tables 29, 30, 41, or 42. Most preferred, the group consists of all markers listed in one or more tables, whereby the tables are selected from the tables 3 to 6, tables 15 to 20, and tables 29, 30, 41, or 42. The invention also contemplates that all markers in all tables are analysed. This holds true for the presently discussed as well as for embodiments discussed further below.
Another embodiment of the invention relates to a method of determining whether a patient sample contains leukemia cells or other cells and at the same time or subsequently determining the type and subtype of leukemia cells, if leukemia cells are present, comprising the steps of determining the expression profile, preferably the level of expression of a group of markers in a patient sample and concluding from the (altered) expression profile i.e. the difference in the level of expression, whether the patient sample contains leukemia cells or other cells and at the same time determining the type and subtype of leukemia cells, if leukemia cells are present, characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 16 to 20 or table 29 or 30 and whereby the number of markers in the group is between one, preferably two such as three, four, five, six, seven, eight, nine or ten and the total number of markers listed in one or more of the tables 16 to 20 or table 29 or 30. It is preferred that the group of markers consists of all markers listed in one or more tables, whereby the tables are selected from the tables 16 to 20 or table 29 or 30. In a preferred embodiment it is differentiated between four types of leukemia cells and the other cells in the patient sample. The other cells are preferably normal cells.
The “other cells” may be, for example, cells affected by a disease which is not a leukaemia. It is preferred, in accordance with the present invention that said other cells are normal cells, i.e. cells not affected by any disease.
This embodiment of the present invention allows for the differentiation between four different types of leukemias, i.e. AML, CLL, CML and ALL. As has been surprisingly demonstrated in accordance with the present invention, the qualitative and/or quantitative determination of an expression profile of a number of genes the unambiguous classing with any of the above and currently established of leukemias. In principle and more preferred, the relation of the gene profile to the leukaemia type may take place at the same time at which determination of the leukaemia cells in the sample takes place. Alternatively, the classification may be effected at a later time point. It was surprising that the nction between the large number of leukemia types and subtypes, including cytogenetically and immunophenotypically defined, as well as types characterized by complex chromosomal aberations, could be accomplished preferably by the use of a microarray for the detection of the expression level of a marker or a group of markers with such ease and accuracy. In particular, certain preferred subsets of genes are provided which can either be used to determine the leukemia type and subtype, or only determine the subtypes of a certain leukemia type or differentiates certain types or subtypes, respectively, from one another.
In another embodiment a method is disclosed which allows differentiating between two types of leukemia cells or one type of leukemia cells and normal cells or non-leukemia cells in a patient sample comprising the steps of determining the expression profile preferably the level of expression, of a group of markers in the patient sample and concluding from the (altered) expression profile, i.e. the difference in the level of expression, which type of leukemia cells the patient sample contains or whether it contains (only) normal cells or non-leukemia cells characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 3 to 6 or tables 7 to 12 and whereby the number of markers in the group is between one, preferably two such as three, four, five, six, seven, eight, nine or ten and the total number of markers listed in one or more of the tables 3 to 6 or tables 7 to 12. In a preferred embodiment the group of markers consists of all markers listed in one or more of the tables 3 to 6 or tables 7 to 12.
In another embodiment of the invention a method is disclosed allowing the differentiation between the subtypes of AML cells or between the subtypes of AML cells and normal cells in a patient sample comprising the steps of determining the expression profile, preferably the level of expression of a group of markers in the patient sample and concluding from the (altered) expression profile, i.e. the difference in the level of expression, which subtypes of AML cells the patient sample contains or whether it contains normal cells characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1, 2, 13, 14, 17, 25, 27, 35 and 36 and whereby the number of markers in the group is between one, preferably two such as three, four, five, six, seven, eight, nine or ten and the total number of markers listed in one or more of the tables 1, 2, 13, 14, 17, 25, 27, 35 and 36. In a preferred embodiment the group of markers consists of all markers listed in one or more of the tables 1, 2, 13, 14, 17, 25, 27, 35 and 36. It is preferred that three, four or more subtypes of AML cells are determined.
In another embodiment of the invention a method is disclosed allowing the differentiation between and thus the determination of the subtypes of ALL cells in a patient sample comprising the steps of (a) determining the level of expression of a group of markers in the patient sample and (b) concluding from the differences in the level of expression which subtypes of ALL cells the patient sample contains whereby the group of markers consists of markers selected independently from the markers listed in one or more of the tables 18, 32 or 33 and whereby the number of markers in the group is between one, preferably two such as three, four, five, six, seven, eight, nine or ten and the total number of markers listed in one or more of the tables 18, 32 or 33. It is preferred that the group of markers consists of all markers listed in one or more of the tables 18, 32 or 33.
In another embodiment of the invention a method is disclosed allowing the differentiation between and thus the determination of the subtypes of CLL cells in a patient sample comprising the steps of determining the level of expression of a group of markers in the patient sample and concluding from the differences in the level of expression which subtypes of CLL cells the patient sample contains whereby the group of markers consists of markers selected independently from the markers listed in one or more of the tables 38 or 39 and whereby the number of markers in the group is between one, preferably two such as three, four, five, six, seven, eight, nine or ten and the total number of markers listed in one or more of the tables 38 or 39. It is preferred that the group of markers consists of all markers listed in one or more of the tables 38 or 39.
In another embodiment of the invention, a method is disclosed of assessing the efficacy of a test compound for inhibiting leukemia, the method comprising comparing the expression profile of a group of markers in a first sample obtained from the patient and maintained in the presence of the test compound and the expression profile of a group of markers in a second sample obtained from the patient and maintained in the absence of the test compound, wherein a significantly altered expression profile of the group of markers in the first sample, relative to the second sample, is an indication that the test compound is efficacious for inhibiting leukemia in the patient characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between one, preferably two such as 3, 4, 5, 6, 7, 8, 9 or 10 and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42.
In accordance with this embodiment of the present invention, it is again preferred that in the comparison of expression profiles expression levels and differences in expression levels are determined and compared. It is further preferred that the alteration determined in accordance with the method of the invention in the expression profile or expression level must be in the direction of the expression profile of normal cells or at least diseased but non-leukemic cells. More preferably the alteration should be in the direction of normal blood cells, more preferably cells of the certain type. Accordingly, it is also preferred that the comparison includes an internal standard of expression levels of analysed markers wherein the internal standard represents the expression profile of non-leukemic and preferably normal cells. The comparison may be effected by relying on actual experimental data or on in silico obtained reference data.
In another embodiment of the invention a method is disclosed of assessing the efficacy of a therapy for inhibiting leukemia in a patient, the method comprising comparing the expression profile, preferably the level of expression of a group of markers in the first sample obtained from the patient prior to providing at least a portion of the therapy to the patient and the expression profile, preferably the level of expression of a group of markers in a second sample obtained from the patient following provision of the portion of the therapy, wherein a significantly (altered) expression profile, i.e. a significantly (altered) difference in the level of expression of the group of markers in the second sample, relative to the first sample, is an indication that the therapy is efficacious for inhibiting leukemia in the patient characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between one, preferably two such as 3, 4, 5, 6, 7, 8, 9 or 10 and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, or 42.
As with the previous embodiment, the alteration determined in accordance with the method of the invention in the expression profile or expression level must be in the direction of the expression profile or normal cells or at least diseased but non-leukemic cells. Accordingly, it is also preferred in accordance with this embodiment that the comparison includes an internal standard of expression levels of analysed markers wherein the internal standard represents the expression profile of non-leukemic and preferably normal cells. The comparison may—again—be effected by relying on actual experimental data or on in silico obtained reference data.
Within the therapy of the patient, compounds may be administered that have at least passed phase II and preferably are within phase III of clinical trials. Advantageously, in one embodiment, a therapeutical composition or medicinal product is administered that comprises one pharmaceutically active compound. In alternative embodiments, pharmaceutical compositions or medicinal products are administered that comprise more than one pharmaceutically active compound. If the composition or product comprises more than at least one pharmaceutically active compound then one of the compounds may aim at the direct reduction of tumor load wherein at least one further compound may fulfil an accessory function such as the general stimulation of the immune system. Compounds of the latter class are also well known in the art and comprise plant derived products as well as immunostimulatory molecules selected from the group of interleukins, interferons and others.
Additionally, the invention contemplates a method of refining a compound identified by the method as described herein above, said method comprising optionally the steps of said methods and:
The target may in accordance with the above be DNA, mRNA or protein. All techniques employed in the various steps of the method of the invention are conventional or can be derived by the person skilled in the art from conventional techniques without further ado. Thus, biological assays based on the herein identified nature of the proteins/(poly)peptides may be employed to assess the specificity or potency of the drugs wherein the increase of one or more activities of the proteins/(poly)peptides may be used to monitor said specificity or potency. Steps (1) and (2) can be carried out according to conventional protocols. A protocol for site directed mutagenesis is described in Ling M M, Robinson B H. (1997) Anal. Biochem. 254: 157-178. The use of homology modeling in conjunction with site-directed mutagenesis for analysis of structure-function relationships is reviewed in Szklarz and Halpert (1997) Life Sci. 61:2507-2520. Chimeric proteins are generated by ligation of the corresponding DNA fragments via a unique restriction site using the conventional cloning techniques described in Sambrook (1989), loc. cit. A fusion of two DNA fragments that results in a chimeric DNA fragment encoding a chimeric protein can also be generated using the gateway-system (Life technologies), a system that is based on DNA fusion by recombination. A prominent example of molecular modeling is the structure-based design of compounds binding to HIV reverse transcriptase that is reviewed in Mao, Sudbeck, Venkatachalam and Uckun (2000). Biochem. Pharmacol. 60: 1251-1265.
For example, identification of the binding site of said drug by site-directed mutagenesis and chimerical protein studies can be achieved by modifications in the (poly)peptide primary sequence that affect the drug affinity; this usually allows to precisely map the binding pocket for the drug.
As regards step (2), the following protocols may be envisaged: Once the effector site for drugs has been mapped, the precise residues interacting with different parts of the drug can be identified by combination of the information obtained from mutagenesis studies (step (1)) and computer simulations of the structure of the binding site provided that the precise three-dimensional structure of the drug is known (if not, it can be predicted by computational simulation). If said drug is itself a peptide, it can be also mutated to determine which residues interact with other residues in the (poly)peptide of interest.
Finally, in step (3) the drug can be modified to improve its binding affinity or ist potency and specificity. If, for instance, there are electrostatic interactions between a particular residue of the (poly)peptide of interest and some region of the drug molecule, the overall charge in that region can be modified to increase that particular interaction.
Identification of binding sites may be assisted by computer programs. Thus, appropriate computer programs can be used for the identification of interactive sites of a putative inhibitor and the (poly)peptide by computer assisted searches for complementary structural motifs (Fassina, Immunomethods 5 (1994), 114-120). Further appropriate computer systems for the computer aided design of protein and peptides are described in the prior art, for example, in Berry, Biochem. Soc. Trans. 22 (1994), 1033-1036; Wodak, Ann. N.Y. Acad. Sci. 501 (1987), 1-13; Pabo, Biochemistry 25 (1986), 5987-5991. Modifications of the drug can be produced, for example, by peptidomimetics and other inhibitors can also be identified by the synthesis of peptidomimetic combinatorial libraries through successive chemical modification and testing the resulting compounds. Methods for the generation and use of peptidomimetic combinatorial libraries are described in the prior art, for example in Ostresh, Methods in Enzymology 267 (1996), 220-234 and Dorner, Bioorg. Med. Chem. 4 (1996), 709-715. Furthermore, the three-dimensional and/or crystallographic structure of activators of the expression of the (poly)peptide of the invention can be used for the design of peptidomimetic activators, e.g., in combination with the (poly)peptide of the invention (Rose, Biochemistry 35 (1996), 12933-12944; Rutenber, Bioorg. Med. Chem. 4 (1996), 1545-1558).
In accordance with the above, in a preferred embodiment of the method of the invention said at least one compound is further refined by peptidomimetics.
The invention furthermore relates to a method of modifying a compound identified or refined by the method as described herein above as a lead compound to achieve (i) modified site of action, spectrum of activity, organ specificity, and/or (ii) improved potency, and/or (iii) decreased toxicity (improved therapeutic index), and/or (iv) decreased side effects, and/or (v) modified onset of therapeutic action, duration of effect, and/or (vi) modified pharmakinetic parameters (resorption, distribution, metabolism and excretion), and/or (vii) modified physico-chemical parameters (solubility, hygroscopicity, color, taste, odor, stability, state), and/or (viii) improved general specificity, organ/tissue specificity, and/or (ix) optimized application form and route by (i) esterification of carboxyl groups, or (ii) esterification of hydroxyl groups with carbon acids, or (iii) esterification of hydroxyl groups to, e.g. phosphates, pyrophosphates or sulfates or hemi succinates, or (iv) formation of pharmaceutically acceptable salts, or (v) formation of pharmaceutically acceptable complexes, or (vi) synthesis of pharmacologically active polymers, or (vii) introduction of hydrophylic moieties, or (viii) introduction/exchange of substituents on aromates or side chains, change of substituent pattern, or (ix) modification by introduction of isosteric or bioisosteric moieties, or
(x) synthesis of homologous compounds, or (xi) introduction of branched side chains, or (xii) conversion of alkyl substituents to cyclic analogues, or (xiii) derivatisation of hydroxyl group to ketales, acetales, or (xiv) N-acetylation to amides, phenylcarbamates, or (xv) synthesis of Mannich bases, imines, or (xvi) transformation of ketones or aldehydes to Schiff's bases, oximes, acetates, ketales, enolesters, oxazolidines, thiozolidinesor combinations thereof; said method optionally further comprising the steps of the above described methods.
The various steps recited above are generally known in the art. They include or rely on quantitative structure-action relationship (QSAR) analyses (Kubinyi, “Hausch-Analysis and Related Approaches”, VCH Verlag, Weinheim, 1992), combinatorial biochemistry, classical chemistry and others (see, for example, Holzgrabe and Bechtold, Deutsche Apotheker Zeitung 140(8), 813-823, 2000).
The invention moreover relates to a method of producing a pharmaceutical composition comprising optionally the steps of the aforementioned methods and further the step of formulating the at least one compound identified, refined or modified by the method of any of the preceding embodiments with a pharmaceutically active carrier or diluent.
The pharmaceutical composition produced in accordance with the present invention may further comprise a pharmaceutically acceptable carrier and/or diluent and/or excipient. Examples of suitable pharmaceutical carriers are well known in the art and include phosphate buffered saline solutions, water, emulsions, such as oil/water emulsions, various types of wetting agents, sterile solutions etc. Compositions comprising such carriers can be formulated by well known conventional methods. These pharmaceutical compositions can be administered to the subject at a suitable dose. Administration of the suitable compositions may be effected by different ways, e.g., by intravenous, intraperitoneal, subcutaneous, intramuscular, topical, intradermal, intranasal or intrabronchial administration. The dosage regimen will be determined by the attending physician and clinical factors. As is well known in the medical arts, dosages for any one patient depends upon many factors, including the patient's size, body surface area, age, the particular compound to be administered, sex, time and route of administration, general health, and other drugs being administered concurrently. A typical dose can be, for example, in the range of 0.001 to 1000 μg (or of nucleic acid for expression or for inhibition of expression in this range); however, doses below or above this exemplary range are envisioned, especially considering the aforementioned factors. Generally, the regimen as a regular administration of the pharmaceutical composition should be in the range of 1 μg to 10 mg units per day. If the regimen is a continuous infusion, it should also be in the range of 1 μg to 10 mg units per kilogram of body weight per minute, respectively. Progress can be monitored by periodic assessment. Dosages will vary but a preferred dosage for intravenous administration of DNA is from approximately 106 to 1012 copies of the DNA molecule. The compositions of the invention may be administered locally or systemically. Administration will generally be parenterally, e.g., intravenously; DNA may also be administered directly to the target site, e.g., by biolistic delivery to an internal or external target site or by catheter to a site in an artery. Preparations for parenteral administration include sterile aqueous or non-aqueous solutions, suspensions, and emulsions. Examples of non-aqueous solvents are propylene glycol, polyethylene glycol, vegetable oils such as olive oil, and injectable organic esters such as ethyl oleate. Aqueous carriers include water, alcoholic/aqueous solutions, emulsions or suspensions, including saline and buffered media. Parenteral vehicles include sodium chloride solution, Ringer's dextrose, dextrose and sodium chloride, lactated Ringer's, or fixed oils. Intravenous vehicles include fluid and nutrient replenishers, electrolyte replenishers (such as those based on Ringer's dextrose), and the like. Preservatives and other additives may also be present such as, for example, antimicrobials, anti-oxidants, chelating agents, and inert gases and the like. Furthermore, the pharmaceutical composition of the invention may comprise further agents such as interleukins or interferons depending on the exact intended use of the pharmaceutical composition.
The above methods referring to downstream developments also apply to therapeutically effective compounds referred to in additional embodiments herein below.
In another embodiment of the invention a method is disclosed of selecting a composition for inhibiting leukemia in a patient, the method comprising separately maintaining aliquots of cells of a patient sample in the presence of a plurality of test compositions, comparing the expression profile, preferably the level of expression of a group of markers in each of the aliquots, and selecting one of the test compositions which induces an altered expression profile of the group of markers in the aliquot containing that test composition, relative to other test compositions characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between one, preferably two such as 3, 4, 5, 6, 7, 8, 9 or 10 and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42.
Again, as with the previously recited embodiments, the alteration determined in accordance with the method of the invention in the expression profile or expression level must be in the direction of the expression profile of normal cells or at least diseased but non-leukemic cells. Accordingly, it is also preferred in accordance with this embodiment that the comparison includes an internal standard of expression levels of analysed markers wherein the internal standard represents the expression profile of non-leukemic and preferably normal cells. The comparison may—again—be effected by relying on actual experimental data or on in silico obtained reference data.
The expression “in the direction of the expression profile of normal cells” as used herein preferably relates to cells that comprise blood cells, more preferably a single type of blood cells. Most preferably, the single type of cells corresponds to the type of the leukemic cell. For example, an AML type of leukemic cell would preferably be compared to a healthy myeloic blast cell whereas a ALL type of leukemic cell would preferably be compared to a healthy lymphatic blast cell. Myeloic blast cells and lymphatic blast cells may be isolated from healthy bone marrow using well known methods, such as cell sorting based on flow cytometry using established cell surface markers.
In this method of the invention, it is preferred that the test composition comprises only one putatively active test compound. Insofar, the correlation with the activity of the test compound and the readout is particularly convenient. If the test composition comprises more than one putatively pharmaceutically active compounds, it may be considered to separately test each compound in a composition that has tested positive in a first round of the assay. Consequently, in a second round, i.e. in a repetition of steps (a) and (b), the various compositions tested positive, if any, in the first round, may be subdivided into single compounds and these single compounds tested again for their efficacy. The goal of such an approach, of course, is to obtain a composition comprising a single active compound only.
In another embodiment a method of determining new subtypes of leukemia cells is disclosed, the method comprising determining. the expression profile, preferably the level of expression of a group of markers of leukemia cells of unknown subtype, comparing the expression profile to the level of expression, ie. the expression profile, of a group of markers of leukemia cells of known subtype, thereby concluding that a new subtype is determined when the expression profile, preferably the level of expression is different to all known subtypes characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between one, preferably two and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42.
The term “subtype of leukemia cells” in accordance with the present invention may be better understood in accordance with the following Leukemias are subdivided according to their natural clinical course into acute and chronic leukemias. Based on the cell line they are derived from they are further subdivided into myeloid and lymphatic leukemias. This results in four leukemia types, i.e. acute myeloid leukemia (AML), acute lymphoblastic leukemia (ALL), chronic myeloid leukemia (CML), and chronic lymphatic leukemia (CLL). Based on genetic, phenotypic, and biological characteristic, which are assessed by cytomorphology, cytochemistry, cytogenetics, immunophenotyping, and molecular genetics, AML, ALL, and CLL are further subdivided into subtypes. These subtypes are associated with highly differing prognoses. Treatment approaches specific for these subtypes are applied and are being further optimized. Thus, an exact diagnosis based on a reliable and reproducible method is essential for the selection of the appropriate subtype-specific treatment.
The new subtypes identified in accordance with the invention may then be subjected in the same or in further patients to the other methods/embodiments of the invention.
In another embodiment a method is disclosed for guiding the therapy of leukemia in a patient depending on the leukemia subtype and/or the risk of relapse of disease, the method comprising determining the expression profile, preferably the level of expression of a group of markers in the patient sample, and deciding about the therapy strategy depending on the leukemia subtype or the risk of relapse of disease characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between one, preferably two such as 3, 4, 5, 6, 7, 8, 9 or 10 and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42.
This embodiment is particularly important for the quick and reliable recovery of the patient from the leukemia that effects him or her. As has been stated above, the early and reliable diagnosis of the leukaemia type or subtype is particularly important for the instigation of a useful and straightforward treatment regimen. An incorrect diagnosis may result in the application of a wrong treatment regimen which, in turn, may lead to significant health risks including premature death of the patient. In accordance with the present invention, a reliable means has been provided that, based on the inventive selection of markers provided, will overcome the prior art problems of an insecure or an inappropriate time frame demanding diagnosis. In particular, the present method of the invention provides in step (a) an unambiguous and safe basis for the decision step (b). Again, the patient may safely rely on the conclusion drawn in step (b) due to the strong inherent correlation that has been achieved between the selection of markers and the leukemia subtype. The relation of tables to leukemia subtypes has also been demonstrated elsewhere in this specification.
In another embodiment of the invention, a method for monitoring the progression of leukemia in a patient is disclosed, the method comprising determining the expression profile, preferably the level of expression of a group of markers in a patient sample at a first point in time, and repeating this step at a subsequent point in time; and comparing the expression profile, preferably the level of expression detected in the previous steps and therefrom monitoring the progression of leukemia in the patient, characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between one, preferably two such as 3, 4, 5, 6, 7, 8, 9 or 10 and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42. In a preferred embodiment, the patient has undergone chemotherapy between the first point in time and the subsequent point in time (including repetitions of step (b).
In this embodiment of the present invention, the skilled artisan may repeat step (b) one or more times in order to collect additional data from different (more) time points. The additional data obtained by such further measurements may provide an overall better overview on the progress of the disease.
In accordance with this embodiment of the disease, the term “progression of leukemia” includes the interpretation of “regression of leukemia”, i.e. includes the interpretation of a negative progression. This is of course in line with the aim of the therapy and the desire of the patient.
In the methods according to the invention it is preferred that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between one, preferably two and the total number of markers listed in the at least one of tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42. In a preferred embodiment, the number of markers in the group is between five, more preferably between 7, 10 or 15 and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42. It is feasible that the group of markers not only consists of those markers but also comprises them as the data will then be still statistically significant, i.e. the preferred groups may additionally contain 10, 50 or 100 other markers and comprise the other markers according to the invention and mentioned above. It is, however, also feasible for the expert skilled in the art that only a single suitable marker is determined with the methods according to the invention.
Particularly preferred markers used in a method where only one or a few as e.g. one, preferably two markers are used are described in Table 22 and Example 3,
Particularly preferred markers used in a method where only one or a few as e.g. one, preferably two markers are used are described in tables 30, 33, 36 and 42 and Example 7, FIGS. 189 to 234, 254 to 272, 338 to 371, 433 to 465, respectively, or the markers marked with an asterisk in tables 29, 32, 35, 38, and 41 and FIGS. 24 to 188, 235 to 253, 273 to 337, 372 to 405, 406 to 432, respectively as the preferred set of markers. In detail, example 7 mentions (see example 7 for more details) the following markers including their expression level:
Preferred methods for detection and quantification of the amount of nucleic acids, i.e. for the methods according to the invention allowing the determination of the level of expression of a marker or a group of markers, are those described by Sambrook et al. (1989) or real time methods known in the art as the TaqMan® method disclosed in WO92/02638 and the corresponding U.S. Pat. No. 5,210,015, U.S. Pat. No. 5,804,375, U.S. Pat. No. 5,487,972. This method exploits the exonuclease activity of a polymerase, to generate a signal. In detail, the (at least one) target nucleic acid component is detected by a process comprising contacting the sample with an oligonucleotide containing a sequence complementary to a region of the target nucleic acid component and a labeled oligonucleotide containing a sequence complementary to a second region of the same target nucleic acid component sequence strand, but not including the nucleic acid sequence defined by the first oligonucleotide, to create a mixture of duplexes during hybridization conditions, wherein the duplexes comprise the target nucleic acid annealed to the first oligonucleotide and to the labeled oligonucleotide such that the 3′-end of the first oligonucleotide is adjacent to the 5′-end of the labeled oligonucleotide. Then this mixture is treated with a template-dependent nucleic acid polymerase having a 5′ to 3′ nuclease activity under conditions sufficient to permit the 5′ to 3′ nuclease activity of the polymerase to cleave the annealed, labeled oligonucleotide and release labeled fragments. The signal generated by the hydrolysis of the labeled oligonucleotide is detected and/or measured. TaqMan® technology eliminates the need for a solid phase bound reaction complex to be formed and made detectable. Other methods include e.g. fluorescence resonance energy transfer between two adjacently hybridized probes as used in the LightCycler® format described in U.S. Pat. No. 6,174,670.
Protocols for carrying out the methods according to the invention are known to the expert in the field and are described in the examples, preferably in example 1 and 4. A preferred protocol is described in Example 1 (A), where total RNA is isolated, cDNA synthesized and biotin incorporated during the transcription reaction. The purified cDNA was applied to commercially available arrays which can be obtained e.g. from Affymetrix. The hybridized cDNA is detected according to the methods described in Example 1 (A). The arrays are produced by photolithography or other methods known to experts skilled in the art e.g. from U.S. Pat. No. 5,445,934, U.S. Pat. No. 5,744,305, U.S. Pat. No. 5,700,637, U.S. Pat. No. 5,945,334 and EP619 321 or EP 373 203. The latter methods are also suitable for producing the composition according to the inventions in particular the composition wherein polynucleotides or oligonucleotides are bound to a solid phase in particular in the form of arrays. In a further preferred embodiment of the methods according to the invention, a transcribed polynucleotide or portion thereof is the marker or at least one of the markers. A particularly preferred transcribed polynucleotide is an mRNA or a cDNA. In a preferred embodiment of the methods according to the invention, the step of determining the expression profile further comprises amplifying the transcribed polynucleotide. In another preferred embodiment, the level of expression, i.e. the expression profile, of the group of transcribed polynucleotides is determined by annealing the transcribed polynucleotides with a complementary polynucleotide or a portion thereof under stringent hybridization conditions. The term “stringent hyberidisation conditions” is equivalent to the term “highly stringent hyberdisation conditions”. Such highly stringent hybridization conditions may be determined in accordance with the teachings provided in Hames and Higgins (eds) “Nucleic acid hybridization, a practical approach”, IRL Press 1985, Oxford, and include hybridization at 55-65° C. in 0.2-0.5×SSC, 0.1% SDS followed by appropriate washing conditions such as 0.5-1×SSC at 55° C. and 0.1% SDS.
In a most preferred embodiment, the patient sample is blood, i.e. blood mononuclear cells, or bone marrow, i.e. mononuclear cells. The methods according to the invention may be performed on fresh or frozen blood, i.e. blood mononuclear cells or bone marrow, i.e. mononuclear cells.
In a preferred embodiment the marker or at least one of the markers is a protein. In another preferred embodiment the expression profile of the proteins is detected using a reagent which specifically binds to one of the proteins whereby preferably the reagent is selected from the group consisting of an antibody, an antibody derivative, and an antibody fragment.
The term “antibody” comprises monoclonal antibodies as first described by Köhler and Milstein in Nature 278 (1975), 495-497 as well as polyclonal antibodies, i.e. entibodies contained in a polyclonal antiserum. Monoclonal antibodies include those produced by transgenic mice. Fragments of antibodies include F(ab′)2, Fab and Fv fragments. Derivatives of antibodies include scFvs, chimeric and humanized antibodies. See, for example Harlow and Lane, loc. cit.
Another embodiment of the invention is a kit preferably for assessing the suitability of each of a plurality of compounds for inhibiting leukemia in a patient, the kit optionally comprising the plurality of compounds; and a reagent for assessing the expression profile of a group of markers characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between two and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42. Another embodiment is a kit preferably for assessing whether a patient is afflicted with leukemia, the kit comprising reagents for assessing the expression profile of a group of markers characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between two and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42. Another embodiment is a kit preferably for assessing the presence of human leukemia cells, the kit comprising an antibody, wherein the antibody specifically binds with a protein corresponding to a marker characterized in that the marker is selected from the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42. Another embodiment is a kit preferably for assessing the leukemia cell carcinogenic potential of a test compound, the kit comprising leukemia cells and a reagent for assessing expression of a marker, wherein the marker is selected from the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42.
Advantageously, the kit of the present invention further comprises, optionally (a) storage solution(s) and/or remaining reagents or materials required for the conduct of scientific and/or diagnostic and/or therapeutic methods. Furthermore, parts of the kit of the invention can be packaged individually in vials or bottles or in combination in containers or multicontainer units.
Another embodiment of the invention is related to a protein or the RNA, cDNA or cRNA corresponding to a marker selected from the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 or the use thereof for the treatment of or vaccination against leukemia. Alternatively and depending on the exact purpose, inhibitors of these compounds such as antibodies, fragments or derivatives thereof may be employed for said purpose.
The invention also contemplates a method for the development or preparation of a pharmaceutical composition for the treatment of leukemia characterized in that a protein corresponding to a marker selected from the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 is admixed with pharmaceutical compounds. Another embodiment of the invention is related to a method for the development or preparation of a pharmaceutical composition for the treatment of leukemia characterized in that a vector comprising a polynucleotide encoding a protein corresponding to a marker selected from the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 is admixed with pharmaceutical compounds. Another embodiment of the invention is a method for the development or preparation of a pharmaceutical composition for the treatment of leukemia characterized in that an antisense oligonucleotide complementary to a polynucleotide encoding a protein corresponding to a marker selected from the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 is admixed with pharmaceutical compounds. Alternatively, inhibitors such as antibodies specific for the markers may be used for the preparation or development of a pharmaceutical composition.
The term “pharmaceutical compounds” is preferably to be understood to mean pharmaceutically acceptable carriers, diluents or excipients, only in connection with the embodiments recited in this paragraph. In another embodiment of the invention a marker or a group of markers selected individually from one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 is used for the determination of leukemia cells, the type or subtype of leukemia cells.
In another embodiment of the invention a marker or a group of markers selected individually from one or more of the tables 1, 2, 13, 14, 17, 25, 27, 35 or 36 is used for the determination of the subtype of AML cells.
In a preferred embodiment, the invention is related to a composition comprising a group of markers and substances chemically different to the markers characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 and whereby the number of markers in the group is between one, preferably two and the total number of markers listed in the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42. It is preferred that the composition according to the invention is characterized in that the group of markers consists of all markers listed in one or more of the tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42. More preferred the composition according to the invention is characterized in that the group of markers consists of all markers listed in one or more of the tables 14, tables 16 to 20, or table 29 or 30, most preferred the group of markers consists of all markers listed in the tables 16 to 20 or tables 29 or 30. Preferably the markers are polynucletides or oligonucleotides, whereby the polynucleotides are bound to a solid phase in the form of an array.
The present invention also relates to a method of determining the subtypes of ALL cells in a patient sample comprising the steps of a) determining the level of expression of a group of markers in the patient sample and b) concluding from the differences in the level of expression which subtypes of ALL cells the patient sample contains characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 18, 32 or 33 and whereby the number of markers in the group is between two and the total number of markers listed in the tables 18, 32 or 33.
Preferably the group of markers consists of all markers listed in one or more of the tables 18, 32 or 33.
The present invention further relates to a method of determining the subtypes of CLL cells in a patient sample comprising the steps of a) determining the level of expression of a group of markers in the patient sample and b) concluding from the differences in the level of expression which subtypes of CLL cells the patient sample contains characterized in that the group of markers consists of markers selected independently from the markers listed in one or more of the tables 38 or 39 and whereby the number of markers in the group is between two and the total number of markers listed in the tables 38 or 39.
It is preferred that the group of markers consists of all markers listed in one or more of the tables 38 or 39.
The present invention is also related to a diagnostic composition comprising at least one nucleic acid molecule, preferably (a) single-stranded nucleic acid molecule(s), which is capable of specifically hybridizing to the mRNA of at least one gene listed in Table 1. The use of said nucleic acid molecules for diagnosis of leukemia subtypes, preferably based on microarray technology, offers the following advantages: (1) more rapid and more precise diagnosis, (2) easy to use in laboratories without specialized experience, (3) abolishes the requirement for analyzing viable cells for chromosome analysis (transport problem), (4) very experienced hematologists for cytomorphology and cytochemistry, immunophenotyping as well as cytogeneticists and molecularbiologists are no longer required, and (5) improves the subclassification of leukemia due to the definition of new entities based on gene expression profiles in those subtypes that are not clearly defined with the methods of the prior art (class discovery).
As used herein, the term “capable of specifically hybridizing” has the meaning of hybridization under conventional hybridization conditions, preferably under stringent conditions as described, for example, in Sambrook, J., et al., in “Molecular Cloning: A Laboratory Manual” (1989), Eds. J. Sambrook, E. F. Fritsch and T. Maniatis, Cold Spring Harbour Laboratory Press, Cold Spring Harbour, NY and the further definitions provided above. Also contemplated are nucleic acid molecules that hybridize at lower stringency hybridization conditions. Changes in the stringency of hybridization and signal detection are primarily accomplished through the manipulation, preferably of formamide concentration (lower percentages of formamide result in lowered stringency), salt conditions, or temperature. For example, lower stringency conditions include an overnight incubation at 37° C. in a solution comprising 6×SSPE (20×SSPE=3M NaCl; 0.2M NaH2PO4; 0.02M EDTA, pH 7.4), 0.5% SDS, 30% formamide, 100 mg/ml salmon sperm blocking DNA, followed by washes at 50° C. with 1×SSPE, 0.1% SDS. In addition, to achieve even lower stringency, washes performed following stringent hybridization can be done at higher salt concentrations (e.g. 5×SSC). Variations in the above conditions may be accomplished through the inclusion and/or substitution of alternate blocking reagents used to suppress background in hybridization experiments. The inclusion of specific blocking reagents may require modification of the hybridization conditions described above, due to problems with compatibility.
As a hybridization probe (or primer) nucleic acid molecules can be used, for example, that have exactly or basically the nucleotide sequence of at least one of the genes depicted in the appended tables or parts of these sequences. The term nucleic acid molecule as used herein also comprises fragments which are understood to be parts of the nucleic acid molecules that are long enough to specifically hybridize to transcripts of at least one of the genes of the appended tables. These nucleic acid molecules can be used, for example, as probes or primers in a diagnostic assay. Preferably, the nucleic acid molecules of the present invention have a length of at least 8, 10, 12, 13, 15, 18 in particular of at least 20 and particular preferred of at least 25 nucleotides. The nucleic acid molecules of the invention or parts therefrom* can also be used, for example, as primers for a PCR reaction. The fragments used as hybridization probe can be synthetic fragments that were produced by means of conventional synthesis methods.
In a preferred embodiment, the diagnostic composition of the present invention comprises at least nucleic acid molecules which are capable of specifically hybridizing to the mRNAs of at least one of the genes listed in the appended tables, preferably 2-5, more preferably 8-12 genes.
In a more preferred embodiment, the diagnostic composition of the present invention comprises at least nucleic acid molecules which are capable of specifically hybridizing to the mRNAs of at least one of the genes listed in the appended tables. In a further preferred embodiment, the diagnostic composition of the present invention comprises at least nucleic acid molecules which are capable of specifically hybridizing to the mRNAs of all genes listed in the appended tables.
In a further preferred embodiment, the nucleic acid molecules of the diagnostic composition of the present invention are bound to (a) a solid support, for example, a polystyrene microtiter dish or nitrocellulose membrane or glass surface or (b) to non-immobilized particles in solution.
In an even more preferred embodiment, the nucleic acid molecules of the diagnostic composition are present in a microarray format which can be established according to well known methods; for details see, e.g., www.affymetrix.com/technology/tech_spotted.html; www.affymetrix.com/technology/tech_probe.html.
The present invention also provides the use of (a) nucleic acid molecule(s) of the present invention for the preparation of a diagnostic composition for the diagnosis of a leukemia or for the diagnosis of several subtypes or a disposition to a leukemia. For the diagnosis of a particular leukemia subtype, preferably, at least 5 different nucleic acid molecules are used as probes. For diagnosis, preferably, bone marrow or peripheral blood can be used. For diagnosis, the target sample is contacted with a (a) nucleic acid molecule(s) of the present invention and the concentration of individual mRNAs is compared with the mRNA expression profile levels of a test sample obtained from healthy donors.
It is a further embodiment of the invention to provide a method of determining whether a patient sample contains leukemia cells or other cells and at the same time determining the type and subtype of leukemia cells, comprising the steps of providing a patient sample, isolating RNA from the patient sample, transcribing the RNA into cDNA and transcribing the cDNA into cRNA while simultaneously labelling the cRNA, hybridising the cRNA to a microarray, and determining the level of expression of a marker or a group of markers.
Further, the invention contemplates the use of a marker or a group of markers for determining whether a patient sample contains leukemia cells or other cells and whereby preferably the type and subtype of leukemia cells is simultaneously or subsequently is determined. The markers specified in the appended examples and tables may, in accordance with the invention, be used to differentiate, for example, between ALL, CLL, CML and AML.
The nucleic acid molecule is typically a nucleic acid probe for hybridization or a primer for PCR. The person skilled in the art is in a position to design suitable nucleic acids probes based on the information provided in the appended tables.
The target cellular component, i.e. mRNA e.g., in bone marrow or blood (BM) may be detected directly in situ, e.g. by in situ hybridization or it may be isolated from other cell components by common methods known to those skilled in the art before contacting with a probe. Detection methods include Northern blot analysis, RNase protection, in situ methods, e.g. in situ hybridization, in vitro amplification methods (PCR, LCR, QRNA replicase or RNA-transcription/amplification (TAS, 3SR), reverse dot blot disclosed in EP 0 237 362)) and other detection assays that are known to those skilled in the art. Preferably, detection is based on a microarray.
Amplification methods include the polymerase chain reaction (PCR) which specifically amplifies target sequences to detectable amounts. Other possible amplification reactions are the ligase Chain Reaction (LCR, Wu and Wallace, 1989, Genomics 4:560-569 and Barany, 1991, Proc. Natl. Acad. Sci. USA 88:189-193); Polymerase Ligase Chain Reaction (Barany, 1991, PCR Methods and Applic. 1:5-16); Gap-LCR(PCT Patent Publication No. WO 90/01069); Repair Chain Reaction (European Patent Publication No. 439, 182 A2), 3SR (Kwoh et al., 1989, Proc. Natl. Acad. Sci. USA 86:1173-1177; Guatelli et al., 1990, Proc. Natl. Acad. Sci. USA 87:1874-1878; PCT Patent Publication No. WO 92/0880A), and NASBA (U.S. Pat. No. 5,130,238). Further, there are strand displacement amplification (SDA), transcription mediated amplification (TMA), and Q□-amplification (for a review see e.g. Whelen and Persing (1996). Annu. Rev. Microbiol. 50, 349-373; Abramson and Myers, 1993, Current Opinion in Biotechnology 4:41-47).
Products obtained by in vitro amplification can be detected according to established methods, e.g. by separating the products on agarose gels and by subsequent staining with ethidium bromide. Alternatively, the amplified products can be detected by using labeled primers for amplification or labeled dNTPs.
The probes can be delectably labeled, for example, with a radioisotope, a bioluminescent compound, a chemiluminescent compound, a fluorescent compound, a metal chelate, biotin or an enzyme.
The invention further contemplates a method of making an isolated hybridoma which produces an antibody useful for assessing whether a patient is afflicted with leukemia, the method comprising isolating a protein corresponding to a marker selected from the group consisting of the markers listed in Tables 1 to 20, tables 25 or 27 or tables 29, 30, 32, 33, 35, 36, 38, 39, 41, 42 immunizing a mammal using the isolated protein, or a peptide corresponding to its sequence or a part thereof; isolating splenocytes from the immunized mammal-, fusing the isolated splenocytes with an immortalized cell line to form hybridomas; and screening individual hybridomas for production of an antibody which specifically binds with the protein to isolate the hybridoma. Further, an antibody produced by this method is contemplated by the invention. The antibody may be fragmented or derivated to obtained fragment or derivatives retaining the antibody specificity as has been described herein above.
The invention further contemplates a method of assessing the leukemia cell carcinogenic potential of a test compound, the method comprising maintaining separate aliquots of leukemia cells in the presence and absence of the test compound; and comparing expression of a marker in each of the aliquots, wherein a significantly altered level of expression of the marker in the aliquot maintained in the presence of the test compound, relative to the aliquot maintained in the absence of the test compound, is an indication that the test compound possesses human breast cell carcinogenic potential wherein a marker according to the invention is used.
The invention further contemplates a system for identifying selected polynucleotide records that identify a leukemia cell, the system comprising: a digital computer-, a database coupled to the computer; a database coupled to the database server having data stored in, the data comprising records of data comprising a polynucleotide, corresponding to a marker according to the invention and a code mechanism for applying queries based upon a desired selection criteria to the data file in the database to produce reports of polynucleotide records which match the desired selection criteria.
The invention also relates to a method for detecting a leukemia cell, using a computer having a processor, memory, display, and input/output devices, the method comprising the steps of
a) providing a sequence of a polynucleotide isolated from a sample suspected of containing a leukemia cell,
b) providing a database comprising records of data comprising a polynucleotide corresponding to a group of markers according to the invention;
c) using a code mechanism for applying queries based upon a desired selection criteria to the data file in the database to produce reports of polynucleotide records of step a) which provide a match of the desired selection criteria of the sequences in the database of step b), the presence of a match being a positive indication that the polynucleotide of step 1) has been isolated from a cell that is a-leukemia cell.
Also, the present invention relates to a method for assessing the leukemia cell carcinogenic potential of a test compound, comprising (a) contacting a non-leukemia cell with a test compound, and (b) assessing an increase or decrease of marker expression in said non-leukemia cell wherein the marker is selected from the tables 1 to 20, 25 or 27, 29, 30, 32, 33, 35, 36, 38, 39, 41 or 42.
The assessment may be effected on the nucleic acid level such as by hybridization techniques or PCR or on the protein level such as by using antibody or aptamers based technologies.
Finally, the invention relates to a diagnostic composition comprising at least one nucleic acid molecule which is capable of specifically hybridizing to the mRNA corresponding to the marker gene of any of the appended tables. The nucleic acid molecule may be an antisense DNA or RNA an RNAi molecule a siRNA molecule or the like inhibitory molecule capable of specifically blocking transcription and/or translation and/or modification and/or localization of the RNA and/or protein corresponding to the marker gene.
The nucleic acid may also be a sense-strand nucleic acid e.g. RNA or preferably DNA which is capable of expressing the protein product of the marker gene, or a protein product of substantially similar activity, in a target cell into which it is introduced.
The invention further comprises pharmaceutical compositions comprising a compound capable of specifically binding to a protein or RNA corresponding to a marker of the invention as listed in any of the appended tables. The marker is preferably selected from the markers designated as particular preferred markers as described herein above. The compound is preferably a compound capable of inhibiting or increasing the function of the protein or of enhancing or decreasing translation of the RNA. The compound is preferably selected from aptamers, aptazynes, RNAzynes, antibodies, affybodies, trinextins, anticalins, or the like compounds. The effect of the compounds on the RNA may be tested by assaying for increased/decreased synthesis of the corresponding protein. The effect of the compounds on the protein may be assayed the testing the effect of the compound in an assay of the proteins function, which e.g. may be an anzymathic function. Alternatively, the effect may be tested by contacting a leukemic cell that expresses large amounts of such protein with the compound and assay cellular parameters associated with the leukemic state of the cell, such as cell growth, growth factor dependency and/or differentiation state of the cell.
In a further embodiment, the invention provides a method of determining whether a patient sample contains leukemia cells or other cells comprising the steps of
In a further embodiment, the invention provides a method of determining whether a patient sample contains leukemia cells or other cells comprising the steps of
In a further embodiment, the invention provides a method of determining whether a patient sample contains leukemia cells or other cells comprising the steps of
In a further embodiment, the invention provides a method of determining whether a patient sample contains leukemia cells or other cells comprising the steps of
In a further embodiment, the invention provides a method of determining whether a patient sample contains leukemia cells or other cells comprising the steps of
The following examples, references, sequence listing and figures are provided to aid the understanding of the present invention, the true scope of which is set forth in the appended claims. It is understood that modifications can be made in the procedures set forth without departing from the spirit of the invention.
Bone marrow (BM) aspirates were taken at the time of the initial diagnostic biopsy and remaining material was immediately lysed in RLT buffer (Qiagen), frozen and stored at −80 C until preparation for gene expression analysis. For microarray analysis the GeneChip System (Affymetrix, Santa Clara, Calif., USA) was used. The targets for GeneChip analysis were prepared according to the current Expression Analysis. Briefly, frozen lysates of the leukemia samples were thawed, homogenized (QIAshredder, Qiagen) and total RNA extracted (RNeasy Mini Kit, Qiagen). Normally 10 ug total RNA isolated from 1×107 cells was used as starting material in the subsequent cDNA-Synthesis using Oligo-dT-T7-Promotor Primer (cDNA synthesis Kit, Roche Molecular Biochemicals). The cDNA was purified by phenol-chlorophorm extraction and precipitated with 100% Ethanol over night. For detection of the hybridized target nucleic acid biotin-labeled ribonucleotides were incorporated during the in vitro transcription reaction (Enzo® BioArray™ HighYield™ RNA Transcript Labeling Kit, ENZO). After quantification of the purified cRNA (RNeasy Mini Kit, Qiagen), 15 ug were fragmented by alkaline treatment (200 mM Tris-acetate, pH 8.2, 500 mM potassium acetate, 150 mM magnesium acetate) and added to the hybridization cocktail sufficient for 5 hybridizations on standard GeneChip microarrays. Before expression profiling Test3 Probe Arrays (Affymetrix) were chosen for monitoring of the integrity of the cRNA. Only labeled cRNA-cocktails which showed a ratio of the measured intensity of the 3′ to the 5′ end of the GAPDH gene less than 3.0 were selected for subsequent hybridization on HG-U95Av2 probe arrays (Affymetrix). Washing and staining the Probe arrays was performed as described (siehe Affymetrix-Original-Literatur (LOCKHART und LIPSHUTZ). The Affymetrix software (Microarray Suite, Version 4.0.1) extracted fluorescence intensities from each element on the arrays as detected by confocal laser scanning according to the manufacturers recommendations.
Class separation by principal component analysis and hierarchical cluster analysis: In a first step we reduced the dimensionality of the number of genes. Therefore we scaled the data from each array to a target intensity value 50 (Affymetrix Microarray Suite) in order to be able to perform inter-array comparisons. Then all data was analyzed using Significance Analysis of Microarrays (Multiclass Response, Stanford University) and we selected a distinct number of genes based on a permutations test. This reduced set of genes which showed to be significant then was analyzed using the public available Java application J-Express analysis tool (download at www.molmine.com). Principal Component Analysis and Hierarchical Cluster Analysis (parameters Cluster method: single linkage and Distance metric: euclidean) showed a clear separation of analyzed groups of samples e.g. healthy bone marrow versus leukemia.
A previously described (Science 1999 Oct. 15; 286(5439):531-7) was modified to reduce the number of candidate genes that could distinguish between our leukemic samples of interest. In a first step the raw data was scaled using Affymetrix software (target intensity 50 for all genes). To avoid division by zero or negative numbers as occurring due to the current expression algorithm (Affymetrix) we set all average intensities of 20 or less to 20. Briefly, for a more detailed gene expression profiling we applied the data analysis method according to Golub et al. using weighted voting. In a first step gene expression levels were log-transformed with a cut-off value set at 20 units. To assess the significance of selected genes we performed a leave-one-out cross-validation. Only those genes were considered important which were contained in all cross validation classificators. To determine the association between genes by chance we performed a permutation test (100 cycles). Because the number of informative genes, which are able to discriminate between samples, is unknown, we applied the Golub method for different numbers of informative genes (range: 10-200). The minimal set of genes which provided optimal classification accuracy was selected to avoid overfitting.
Monitoring the gene expression level of thousands of mRNA transcripts simultaneously in one experiment is the key technology to find out the specific genes which allow the subsequent development of a class prediction model. We therefore used the Affymetrix oligonucleotide microarray technology (GeneChip® Instrument System) to obtain gene expression profiles of each individual clinical sample of interest. The HG-U95Av2 probe arrays gave us information about the relative mRNA abundance of about 12,000 full length human genes which are represented on these high-density oligonucleotide microarrays.
In total, 8 bone marrow samples of healthy volunteers and leukemia patients were investigated. Five different types of bioinformatic calculations were performed.
Three defined cytogenetic aberrations t(8;21)(q22;q22) (n=9), t(15;17)(q22;q12) (n=16) and M4eo with inv(16) (p13q22) (n=10) corresponding to the 4 FAB-subtypes AML M2, M3, or M3v and M4eo, respectively. After we obtained bone marrow aspirates from 35 untreated patients with newly diagnosed AML, all cases were characterized by cytomorphology, cytogenetics and by molecular genetics (
We obtained bone marrow (BM) aspirates from 37 AML patients standing for four morphological and three underlying cytogenetic subgroups that were sent to the Laboratory of Leukemia Diagnostics (LFL) for central diagnosis within the German AMLCG study (Klinikum Grosshadern, Munich, Germany). They were selected for this study on the basis of several criteria. It was mandatory that none of the patients had been treated. All samples, exclusively newly diagnosed in our laboratory, had to be well characterized as de novo AML and diagnosis had been proven by cytomorphology, cytogenetics, flow cytometry and molecular genetics in every single case. All samples for gene expression analysis were taken at the time of the initial diagnostic biopsy when remaining material was immediately lysed in RLT buffer (Qiagen), frozen and stored at −80 C until preparation for gene expression analysis.
For microarray analysis the GeneChip System (Affymetrix, Santa Clara, Calif., USA) was used. The targets for GeneChip analysis were prepared according to the current Expression Analysis Technical Manual. Briefly, frozen lysates of the leukemia samples were thawed, homogenized (QIAshredder, Qiagen) and total RNA extracted (RNeasy Mini Kit, Qiagen). Normally 10 ug total RNA isolated from 1×107 cells was used as starting material in the subsequent cDNA-Synthesis using Oligo-dT-T7-Promotor Primer (cDNA synthesis Kit, Roche Molecular Biochemicals). The cDNA was purified by phenol-chlorophorm extraction and precipitated with 100% Ethanol over night. For detection of the hybridized target nucleic acid biotin-labeled ribonucleotides were incorporated during the in vitro transcription reaction (Enzo® BioArray™ HighYield™ RNA Transcript Labeling Kit, ENZO). After quantification of the purified cRNA (RNeasy Mini Kit, Qiagen), 15 ug were fragmented by alkaline treatment (200 mM Tris-acetate, pH 8.2, 500 mM potassium acetate, 150 mM magnesium acetate) and added to the hybridization cocktail sufficient for 5 hybridizations on standard GeneChip microarrays. Before expression profiling Test3 Probe Arrays (Affymetrix) were chosen for monitoring of the integrity of the cRNA. Only labeled cRNA-cocktails which showed a ratio of the measured intensity of the 3′ to the 5′ end of the GAPDH gene less than 3 were selected for hybridization on HG-U95Av2 probe arrays (Affymetrix). Washing and staining the Probe arrays was performed as described. The Affymetrix software (Microarray Suite, Version 4.0.1) extracted fluorescence Intensities from each element on the arrays as detected by confocal laser scanning according to the manufacturers recommendations.
In a first step we reduced the dimensionality of the number of genes. Therefore we scaled the data from each array to a target intensity value 50 (Affymetrix Microarray Suite) in order to be able to perform inter-array comparisons. Then all data was analyzed using Significance Analysis of Microarrays (Multiclass Response, Stanford University) and we selected 580 genes based on a permutations test. This reduced set of genes which showed to be significant then was analyzed using the public available Java application J-Express analysis tool (download at www.molmine.com). Principal Component Analysis and Hierarchical Cluster Analysis (parameters Cluster method: single linkage and Distance metric: euclidean) showed a clear-separation-of-analyzed groups-of-samples e.g. healthy bone marrow versus leukemia.
This analysis was carried cut as described in Example 1 (C) above. Briefly, classification of tumor samples was achieved by using a set of samples whose class had been already determined. This set was called training set. By using the oligonucleotide microarrays (Lockhart, D. J., et al., Nat Biotechnol 14 (1996) 1675-80), the, transcript levels in training set samples were measured for those genes that were represented on the microarray. The values for “transcription strength” were determined by averaging the values of a set of probes which were compared to a set of nearly identical probes containing a single mismatch. This was performed by using; methods provided by the oligonucleotide array of Affymetrix Inc.
In order to obtain comparable values between different samples, they had to be standardized first. The method followed that described (Lockhart, D. J., et al., Nat Biotechnol 14 (1996) 1675-80), except that correcting for (additive) background had been omitted. In brief, the data from one of the samples were declared to serve as a “standard”, and the values from all other samples were adapted to this standard. For every possible comparison to this standard, a set of “reliable” values was determined by calculating the correlation coefficient for a series of intervals of increasing length. The lower bound of reliability was the bound of the interval that had a correlation coefficient less than or equal to the smaller intervals. From all reliable values, 2 (logarithmized) correction factor was calculated by computing the median of the differences of the logarithmic values. Values that were zero or negative prior to taking the logarithm were not taken into account.
The obtained data matrix contained values from one sample per column. The gene expression profile across all samples for one gene or gene fragment represented on the oligonucleotide microarray was contained in a row of the matrix. To allow for rapid calculation of the classifier and to reduce memory usage, certain genes were pre-selected from the set of all genes represented on the array. The following criteria were applied:
μ1 refers to the average of the i-th class (i=1, . . . ,k), μ to the total average, σi to the standard deviation of the i-th class and t to an arbitrary treshold ≦1. Selection by these methods resulted typically in a reduction in the number of genes by a factor of 10-30. To check the quality of the selection procedure, the first two principal components (Jolliffe, Principle Components Analysis (1986), Springer (New York)) for the samples were plotted. This allowed to judge whether or not a rigorous discrimination was possible between the different classes.
For construction of the classifier, decision trees (Breiman et al., Classification and regression try, Wadsworth & Brooks/Cole (Monterey)) were used. Simple decision trees that discriminate between n classes by using only transcription levels for (n−1) genes were used. They were trained and the selected genes were the discarded from the original data set. A new tree was constructed by using the truncated data set and the entire procedure was iterated until a predetermined number of trees was reached. The optimal number of trees could be estimated by counting the number of misclassifications of classifiers built from different numbers of trees. For this, an independent data set of cross-validation had to be used. The final vote of the multi-classifier was obtained by applying a vote-by-majority rule to the predictions of the contained trees. In the example of the present invention 15 decision trees had been used for the multi-classifier. This allowed perfect classification of 100% of the samples, discriminating between classes that were given by chromosomal aberrations. To estimate generalization properties, i.e. how accurate the classifier may perform on samples that have not been used for training, cross-validation had been used (Efron and Tibshirani, An introduction to the bootstrap (1993), Chapman & Hall (New York, London), pp. 237-247).
From this point of view it was found that a set of 17 genes was sufficient to distinguish distinct AML subtypes from each other with high precision (Tables 1).
The classification model was able to identify the 4 morphologically and 3 cytogenetically and molecular biological different subtypes AML with t(8;21), with t(15;17), and with inv(16) (
In conclusion by comparison of gene expression profiles of AML samples (3 tested genetic subtypes t(8;21), t(15;17) and inv(I6)) genes could be identified which allowed a differentiation between each individual AML subtype in detail could be shown for the first time that these distinct abnormalities on the genomic level relate to a specific gene expression pattern. In other words, in the experimental setting the knowledge of the expression status of these designated genes was sufficient to predict the genetic abnormality and allows the diagnosis of specific genetically defined subtypes of AML (Table 1).
Results of methods described in I(E) are shown in Table 2 and
II) Pair-wise comparisons between normal bone marrow, AML ALL, CML, and CLL: By pair-wise comparisons gene expression profiles of 8 cases of normal bone marrow, 48 AML, 9 ALL, 8 CML, and 7 CLL were evaluated. These led to the identification of subtype-specific genes (Tables 3-12.
To allow classification of AML subtypes according to the new WHO proposal we used the gene expression profiles of four genetically defined AML subtypes (t(8;21) n=9; t(15;17) n=18; inv(16) n=10; 11q23/MLL aberrations n=11). This led to the identification of subtype-specific genes (Table 13,
IV) Normal bone marrow versus distinct genetic subtypes of AML: We used the gene expression profiles of normal bone marrow (n=8) and of four genetically defined AML subtypes (t(8;21) n=9; t(15;17) n=18; inv(16) n=10; 11q23/MLL aberrations n=10). This led to the identification of genes that allow the distinction between normal bone marrow and each of the four AML subtypes (Table 14).
V) Identification of genes specifically separating normal bone marrow, AML, ALL, CML, and CLL: We used the gene expression profiles of normal bone marrow (n=8) and of AML (n=48), ALL (n=9), CML (n=8), and CLL (n=7). This led to the identification of xx genes that allow the distinction between normal bone marrow and each of the four leukemia subtypes (Table 15,
The expression profiles of 12,600 genes were analyzed in 103 patients suffering from chronic myeloid leukemia (CML), chronic lymphoid leukemia (CLL), acute lymphoblastic leukemia (ALL), and acute myeloid leukemia (AML). A set of 71 genes shown in table 16 to 19 was identified as the minimal set necessary to accurately diagnose prognostically relevant leukemia subtypes and to distinguish these from normal bone marrow (BM, n=8). Thus, microarray technology is a suitable method for diagnosis of leukemia.
Today, the classification of hematological malignancies according to the WHO criteria describes chronic myeloid leukemia (CML), chronic lymphoid (CLL), acute lymphoblastic (ALL), and acute myeloid leukemia (AML). Within the latter two several prognostically relevant subtypes are established (see example 4). This subclassification is based on genetic abnormalities of the leukemic blasts associated with different prognoses and becomes increasingly important to guide therapy. Thus, the development of new, specific treatment approaches requires the precise identification of these subtypes that may benefit from individual therapeutic protocols. It has already been shown that the development of drugs targeting molecular aberrations can dramatically improve outcome. The introduction of all-trans retinoic acid (ATRA) into the treatment of AML with t(15;17)(q22;q11-12) has improved outcome from about 50% to 80% long-term survivors (1). In CML patients imatinib, a designed molecule that inhibits the t(9;22)(q34;q11) specific chimeric tyrosine kinase BCR-ABL, induces dramatically higher response rates as compared to conventional drugs (2). To fully take advantage of specific treatment options a precise identification of distinct leukemia subtypes is mandatory. However, standard diagnostics of leukemia using a combination of complementary methods is expensive, time-consuming, and requires experienced specialists.
Since its introduction, microarrays have been promising tools for basic research. With regard to leukemia, the pivotal discrimination of unselected ALL and AML samples based on their gene expression signatures inspired numerous studies (3). Recently, subtypes of childhood ALL could be correlated to specific gene expression profiles leading to both marker genes suitable for Initial diagnostics and canditates as predictors for outcome (Yeoh, Eng-Juh. pediadric ALL expression profiling Cancer Cell, 2002). Additionally, novel entities in hematological malignancies could be identified based on their distinct expression pattern as has been shown for multiple myeloma, large cell lymphoma, and childhood ALL (4-6). In example 4, it is demonstrated that cytogenetically defined AML subtypes can be correlated to specific gene expression profiles (see example 4). AML FAB M2 with t(8;21)(q22;q22), FAB M3/M3v with t(15;17)(q22;q11-12), or M4eo with inv(16)(p13q22) could be classified based on a minimal set of 13 genes. These genes belong to a large variety of different functional classes including members of signaling pathways, cell surface antigens, as well as plasma glycoproteins. Several genes are known to be involved in cytoskeletal structure, transcriptional processes, or have not yet further been functionally described.
Here, gene expression profiles of 103 leukemia patients were acquired representing 11 groups and eight normal BM donors to designate leukemia-specific genes which build up the basis for a novel diagnostic tool. Following the aims of Golub, who introduced the cancer class prediction methodology (3, 7), all four major leukemia types were analyzed and also included cytogenetically defined subgroups of AML and ALL as described in the WHO classification of leukemia, respectively (
In a first step, based on 23 informative genes the samples were assigned to either normal BM, CLL, CML, ALL, or AML, respectively (Table 22; Description of Table 22: Classification scheme for 4 major leukemia types and normal BM. Matrices delineate distribution of actual leukemia types as compared with predicted types from pairwise comparisons. Class assignment can be based on the expression profiles of 23 genes. Except for pairwise comparison of AML versus ALL, all cases can be predicted accurately in leave-one-out cross validation with 100% accuracy and strong confidence. For each pairwise comparison the minimal set of informative genes is represented by approved HUGO Gene Nomenclature Committee (HGNC) symbols. Not yet approved genes are marked by asterisks.). In 9/10 pairwise comparisons all samples were classified correctly (335 individual assignments; 100% accuracy). In one comparison (AML versus ALL) 75/77 samples were classified correctly resulting in an accuracy of 97%. Two ALL samples were misclassified as AML. This may be due to the heterogenity of both groups (n=18 versus n=59) causing noise in the expression data.
For each pairwise comparison a set of discriminative genes is disclosed in table 20 whereby the gene names can be found in table 21. The most discriminative and informative genes are marked by asterisks in table 20 and are the 71 genes shown in table 16 to 19
In detail, we found phospholipidscramblase 1 (PLSCR1) to be lower expressed in AML and ALL as compared to normal BM. PLSCR1 encodes for a plasma membrane protein and has been proposed to play a role in transbilayer migration of phospholipids and in recognition and phagocytic clearance of injured, aged, or apoptotic cells (8). The biologic effects of interferon-alpha may be mediated by PLSCR1 which is markedly upregulated by interferon (9, 10). We also observed that LEF-1 was absent in myeloid leukemias but highly expressed in lymphoid leukemias. LEF-1 was shown to be mitogenic and important for cell survival in pro-B cells (11). The B-cell specific coactivator of octamer binding transcription factors, POU2AF1, plays an important role in the antigen-driven stages of B cell activation and maturation and discriminates between AML and CLL (12). MSF has been reported to be a translocation partner of the mixed-lineage leukemia gene (MLL) in AML and was able to separate AML from ALL (13). Likewise, OS-9, not yet further functionally described except for amplification in osteosarcomas, was differentially expressed between AML and ALL (14). HLA-DMB plays a critical role in antigen presentation by catalyzing the release of class II HLA-associated invariant chain binding sites for acquisition of antigenic peptides (15). It is known that lymphocytes in CLL express high levels of class II antigens whereas mature myeloid leukemias are e.g. HLA-DR negative (16, 17). Therefore, the differential expression of HLA-DMB in CML as compared to CLL illustrates well the differential expression of cell surface MHC class II molecules. NCOA1 plays a critical role in STAT3 and STAT6 pathways and was expressed higher in CLL as compared to ALL suggesting an inhibitory effect of STAT6-mediated transactivation in CLL (18). A member of the tumor necrosis factor receptor family, whose surface expression has already been reported in CLL (19), CD27, was identified to assign samples either ALL or CLL. We also detected LCN2 that was shown to be a modulator of inflammation regulated by interleukin-9 with highest expression in CML samples (20). IRF4, an immune system-restricted interferon regulatory factor that is required for lymphocyte activation showed no expression in CML while it was expressed in normal BM. Recently, an increase of IRF4 levels in CML patients demonstrated an association with a good response to interferon-alpha therapy (21). Several other proteins (DEFA3, SGP28, CAMP, CLC) are known to be stored in the granules of neutrophils and allowed assignment of leukemic samples to the CML type if highly expressed (22-25).
The second step of our approach was to build up a classifier for the identification of AML subtypes genetically defined according to the WHO classification, i.e. AML with t(8;21), with t(15;17) with inv(16), and with 11q23-translocations involving the MLL gene, respectively. In addition, a category ‘other’ was analyzed comprising AML with normal karyotype (n=3), AML with complex aberrant karyotype (n=4), and AML with trisomy 8 as sole abnormality (n=3), respectively. A set of 25 most informative genes was identified based on pairwise comparisons and one-versus-all (OVA) comparisons. None of these genes had already been identified for the classification of the four leukemia types and normal BM as described above. As shown in
The following genes were identified in OVA comparisons and discriminate distinct AML subtypes. The gene most valuable for prediction of AML M4eo with inv(16) was MYH11. Its higher expression as compared to all other AML most probably is due to hybridization of the M4eo-specific fusion transcripts CBFβ-MYH11 to corresponding MYH11-oligonucleotides represented on the microarray (26). Likewise, the higher expression of CBFA2T1 (formerly ETO) in AML with t(8;21) may be due to a similar effect of hybridization of the subtype-specific AML1-ETO fusion transcript (27). Another highly characteristic gene for t(8;21) positive AML was POU4F1, which has been described to play an important role in retinal ganglion cell differentiation and has recently been shown to confer an oncogenic potential when co-transfected with H-RAS (28). Furthermore, it was shown to be highly expressed in neuro-epithelioma and ewing sarcomas (29). Another member of this transcription factor family, POU2F2, was able to discriminate between t(11q23)/MLL versus group ‘other’. A related gene, POU2AF1, has recently been reported to be underexpressed in acute leukemia with t(11q23)/MLL-rearrangement (5). The most informative genes in our approach discriminating AML with t(11q23)/MLL-rearrangement from all other AML subtypes were SOCS-2 and MBNL. Generally, SOCS-2 shows a higher expression level in AML with t(11q23)/MLL-rearrangement and is known to play a role in cytokine-induced signaling pathways (30). Similarly, MBNL shows a higher expression in AML with t(11q23)/MLL-rearrangement as compared to all other AML samples. Its encoded protein as well as other MBL family members are localized in the nucleus and share a Cys3His zinc finger motif (31). MBL proteins occur in several isoforms due to alternative splicing (32) and may have different functions as has been shown for HOX genes (33). HOXA9 has been reported to be highly expressed in leukemia with MLL-rearrangements (5). In contrast, expression of HOXB5 is characteristic of AML group ‘other’ as compared to all other AML subtypes in our data. The most important genes discriminating AML with t(15;17) from all other AML subtypes were ARGHGAP4 and CTSW. ARGHGAP4 is predominantly expressed in hematopoietic cells but showed a lower expression level in AML with t(15;17) as compared to all other AML subtypes. It encodes a member of signaling proteins involved in regulation of small GTP-binding proteins of the RAS-superfamily, which themselves play an important role in cell cycle and apoptosis (34). CTSW encodes for a recently described papain-like cysteine protease, which is predominantly expressed in NK cells and to a lesser extent in cytotoxic lymphocytes. It may represent a putative component of the endoplasmatic reticulum resident proteolytic machinery (35). A survey about the expression levels of genes in the AML subtypes can be found in
Subclassification of ALL comprising the three B-lineage groups ALL with t(9;22), with t(4;11), or with t(8;14) was analyzed next and compared with T-lineage ALL expression profiles. All samples were classified correctly on the basis of 19 genes (
In detail, the genes encoding for the T cell receptor beta subunit and T cell surface CD3 delta chain (TRB, CD3D) were identified as highly indicative of T-ALL as compared to both ALL with t(9;22) and all other ALL subtypes. This is in line with standard diagnostics of T-ALL by immunophenotyping where these antigens comprise the most specific ones (36). Similarly, MME (formerly CD10) was highly expressed in ALL with t(9;22) only. This on the one hand may reflect that t(9;22) is observed in common-ALL and in pre-B ALL only. On the other hand, this data again demonstrates that the gene used for diagnostic purposes in flow cytometry, MME, may be highly indicative of these ALL subtypes in comparisons to the more immature B-lineage ALL, i.e. pro-B ALL, as well as the mature B-ALL and the T-ALL. Furthermore, the identification of connective tissue growth factor (CTGF) as a specific marker for ALL with t(4;11) adds to previous data demonstrating its increased gene expression in childhood ALL in general (37). The glucocorticoid receptor beta has been shown to be highly expressed in ALL with t(4;11) but not in t(9;22) positive ALL. This is in line with the particularly poor prognosis of the latter subtype since response to corticoid therapy is one of the most powerful prognostic factors in ALL (38, 39). In addition, we speculate that new treatment options may be realized for T-ALL based on the high expression of adenosine deaminase (ADA) in this subtype. Inhibitors of ADA have been shown to be effective in indolent T-cell lymphomas but have not yet been evaluated in T-ALL (40). One cytokine differentially expressed between t(8;14) positive ALL and T-lineage ALL was SCYA3. We recommend testing the monitoring of its protein expression as a supplemental antigen useful for immunophenotypical identification of t(8;14) positive ALL. Finally, in ALL carrying t(4;11) v-myb is highly expressed and may thus be involved in the pathogenesis of this subtype. In general, a role of v-myb has been described for the transformation of myelomonocytic cells (41). A survey about the expression levels of genes in the AML subtypes can be found in
At least, we intended to separate t(9;22) positive from t(9;22) negative ALL. Our data contained two genes encoding for ADCY3 and the hypothetical protein KIAA1013 which were sufficient for the 100% correct assignments of 18 analyzed cases. Both genes showed a higher expression in t(9;22) positive as compared to t(9;22) negative ALL. Additionally, distinguishing B lineage from T-lineage ALL, CD3D and TRB repeatedly showed their usefulness as T-ALL marker genes as already described in
Generally, chromosomal aberrations are strongly associated with morphological characteristics. However, there are two chromosomal aberrations which are observed in both myeloid and lymphatic neoplasms, i.e. t(11q23)/MLL and the t(9;22). The t(9;22) occurs in ALL and CML, and t(1 q23)/MLL is observed in ALL and AML, respectively. Analyzing gene expression signatures of both t(9;22) positive ALL and CML we identified two genes, which allowed 17/17 correct lineage assignments. CD74 plays a critical role in MHC class II antigen processing and demonstrated a lower expression in t(9;22) positive CML (42). This may also explain the relationship between the low MHC class II antigen presentation in CML in general and fits well to the recognized lower HLA-DMB expression in CML as compared to CLL (Table 1). CAPN3 is a member of the papain superfamily and was higher expressed in CML discriminating them from t(9;22) positive ALL [see (Note—38894_g_at)].
In addition, our results indicate that the expression signatures of two genes, CD24 and CTGF, are sufficient for 14/14 correct assignments of the t(11q23)/MLL positive leukemias either to ALL or to AML. Thus, in both scenarios lineage assignment can be accomplished based on specific gene expression signatures despite the same underlying chromosomal aberrations.
Taken together, these data demonstrate the utility of gene expression profiling using microarrays for diagnosis of leukemia. In total, 11 different leukemia entities could clearly be distinguished from each other and from normal BM, respectively. These leukemias are associated with highly differing prognoses and require specific treatment strategies. By performing these analyses on a single platform requiring basic molecular biological methods, this approach guarantees broad access to high-quality diagnosis of leukemia. The robust gene expression analysis with high diagnostic accuracy can substitute the combination of cytomorphology, cytogenetics, immunophenotyping, and molecular biological methods used today. Compared to the combination of methods used so far, this approach also reduces costs. In order to introduce diagnostical genomics into routine clinical practice, prospective trials in parallel to conventional methods are necessary to prove the reliability in a large cohort of patients. Furthermore, gene expression patterns will allow the additional subclassification of leukemia especially in subtypes with no specific cytogenetic markers and the identification of deregulated master genes within distinct leukemia entities can guide the way to new therapeutic approaches.
When comparing two groups of microarray experiments, Golub's method sorts the genes with respect to the signal-to-noise ratio of gene x: Sx=(μ1−μ2)/(σ1+σ2), where μk and σk denote the mean expression and standard deviation of gene x in group k. According to a specified number of “informative” genes (e.g. 20) the best discriminating genes are selected. For each informative gene a decision limit is calculated as bx=(μ1+μ2)/2. To classify a new sample of an independent test set, the gene expression levels of informative genes are taken and for each gene x and sample y a so-called vote is calculated as Vx=Sx(gxy−bx), where gxy denotes expression level of gene x in sample y. The votes of all informative genes are summed up (“weighted voting”) and depending upon the sign of this sum the new sample is classified as group 1 or group 2. The confidence in the prediction is calculated as |ΣVx/Σ|Vx||.
However, the decision limit proposed by Golub does not provide optimal classification accuracy in all situations. Importantly, when the standard deviation of expression levels within the two groups are very different, the decision limit is biased towards the group with the higher standard deviation. A decision limit for a particular gene can be considered optimal, if it achieves maximum classification accuracy for a given dataset. By determining systematically classification accuracies for a set of possible decision limits, an optimal decision limit can be calculated. We selected an optimal decision limit from the following set of decision limits Lx: Lx={(gxy+gxy−1)/2|1<y<=n} where gxy denotes expression level of gene x in sample y, n denotes the total number of samples in the training set.
Additionally, we applied an heuristic approach to select a minimal set of discriminative genes, which provides maximum classification accuracy in leave-one-out-crossvalidation. We applied for a given set of 20 informative genes weighted voting as described above and the classification accuracy was calculated by crossvalidation. Therefore, our algorithm consists of the following steps: (i) Calculate the top 20 discriminating genes according to the signal-to-noise ratio. (ii) Calculate classification accuracy and confidence based on optimal decision limits for each of the top 20 genes (iii) Select the gene which provides best classification accuracy and confidence out of step 2. (iv) Test for each of the remaining 19 genes, whether adding this gene to the model improves accuracy and confidence; if the gene improves accuracy and confidence, it is added to the weighted voting model, otherwise it is discarded.
In detail, this method can be described as follows:
Differentially expressed genes can potentially be used in medical diagnostics, if the gene expression patterns are reliable and specific for a particular disease. diffgenes is a program to identify differentially expressed genes in microarray experiments. Its algorithm is based on the method proposed by Golub, but contains two improvements: an optimized decision limit per gene and a minimal set of discriminative genes.
The new method was applied to a human dataset from the domain of cancer research consisting of 103 microarrays with 12625 genes each. diffgenes outperforms Golub's method clearly both in terms of accuracy and confidence of classifications. The biological validation of the results is facilitated, because diffgenes identifies a very small number of candidate genes (typically <5). Microarray datasets can be analyzed with diffgenes on the Internet at http://martin-dugas.de/diffgenes/
Microarrays are used in ongoing research to characterize disease processes on a molecular level. Gene expression analysis enables to identify new subtypes within known diseases with prognostic relevance for the patients [Alizadeh 2000].
For interpretation of the wealth of data—more than 10.000 parameters per experiment—it is advisable to integrate microarray data with detailed clinical information. For applications in medical diagnostics, significant associations between gene expression profiles and sample groups resulting in classification accuracies in the range of 70-80% are not sufficient; for diagnostic purposes at least 95% classification accuracy is required.
If a certain disease is characterized by a specific gene product, e.g. a pathologic fusion gene, a precise measurement of the expression of this particular gene should be a reliable marker for the disease. Therefore in a diagnostic setting, very few and specific genes would be desirable.
However, for many diseases the precise molecular pathogenesis is not yet known. In addition, the function of many genes on currently available microarrays like Affymetrix GeneChip® is still unclear.
Therefore microarray data should be analyzed and interpreted carefully. By integration of data from different diagnostic modalities (morphology, PCR, FISH, clinical data) the biological plausibility and consistency of microarray data can be verified.
When comparing two groups of microarray experiments, Golub's method sorts the genes with respect to the signal-to-noise ratio of gene x: Sx=(μ1−μ2)/(σ1+σ2), where μk and σk denote the mean expression and standard deviation of gene x in group k.
According to a specified number of “informative” genes (e.g. 20) the best discriminating genes are selected. For each informative gene a decision limit is calculated as bx=(μ1+/μ2)/2. To classify a new sample of an independent test set, the gene expression levels of informative genes are taken and for each gene x and sample y a so-called vote is calculated as Vx=Sx(gxy−bx), where gxy denotes expression level of gene x in sample y. The votes of all informative genes are summed up (“weighted voting”) and depending upon the sign of this sum the new sample is classified as group 1 or group 2. The confidence in the prediction is calculated as |Σ Vx/Σ|Vx||. To assess the significance of each gene, a permutation test is performed, which determines signal-to-noise ratios when class labels are permuted randomly. To assess the robustness of the classifier, a leave-one-out crossvalidation is performed. Accuracy is the rate of correctly classified test samples. Further details are contained in [Golub 1999], [Pomeroy 2002, Supplement].
The decision limit proposed by Golub does not provide optimal classification accuracy in all situations. As can be seen in
A decision limit for a particular gene can be considered optimal, if it achieves maximum classification accuracy for a given dataset. By determining systematically classification accuracies for a set of possible decision limits, an optimal decision limit can be calculated. The diffgenes program selects an optimal decision limit from the following set of decision limits Lx:
Lx{(gxy+gxy−1)/2|1<y<=n}
where gxy denotes expression level of gene x in sample y, n denotes the total number of samples in the training set.
Golubs method selects an arbitrary number of “informative” genes to discriminate between two classes of samples according to their signal-to-noise ratio, typically in the range of 10 to 50 genes. Choosing too many genes carries the risk of overfitting, which causes poor generalization features of the model. Therefore diffgenes applies an heuristic approach to select a minimal set of discriminative genes, which provides maximum classification accuracy in leave-one-out-crossvalidation. I.e. for a given set of genes weighted voting as described by Golub is applied and the classification accuracy is calculated by crossvalidation.
The diffgenes algorithm consists of the following steps:
The method was applied to a new human dataset from the domain of cancer research consisting of 103 Affymetrix Genechip(R) microarrays with 12625 genes each. Table 23 presents an analysis of 18 samples class A versus 85 samples class non-A (Description of Table 23: Analysis of 18 samples class A versus 85 samples class non-A. On the left the analysis according to Golub is presented for 20 informative genes. The crossvalidation accuracy is 0,87, confidence 0,77. Samples, where crossvalidation failed, are listed. For each gene signal to noise ratio, p-value (significance obtained from permutation test) and decision limit are provided. On the right the same data set is analyzed using diffgenes. By selection of 3 genes (marked with asterisks) out of the top 20 genes and selecting optimized decision limits, the crossvalidation accuracy reaches 0,96, confidence 0,88.). Based on 20 informative genes Golub's method results in a crossvalidation accuracy of 0,87 (confidence 0,77); diffgenes achieves with three genes out of the top 20 set a crossvalidation accuracy of 0,96 (confidence 0,88). The same analysis was performed for one versus all (OVA) and all pairs (AP) comparisons in this dataset consisting of 5 different classes.
There are two major challenges in the analysis of microarray data: the number of variables (genes) is much higher than the number of individual samples and the correlation structure of the parameters is widely unknown.
Golub's method to analyse microarray data has been applied to important medical datasets [Armstrong 2002]. Recently many different approaches have been applied to microarray data: Classical statistical techniques like ANOVA with adjustment for multiple testing, significance analysis of microarrays (SAM) [Tusher 2001], selection of discriminative genes with support vector machines (SVM), neural networks and many more. This indicates that the underlying problem is important and non-trivial; a comparison of different methods is needed. Robustness of the generated mathematical models is an important issue, therefore bootstrap procedures and permutation tests are applied.
For medical diagnostics differentially expressed genes are of interest, but the sensitivity and specificity for particular diseases must be validated prospectively in larger patient cohorts diffgenes is an extension of Golub's method to improve classification accuracy, which is very relevant in a diagnostic setting. The optimized decision limit plays an important role, because the situation presented in
Emphasis must be placed on verification of results by other diagnostic procedures, because the selected “important” genes are not only dependent on the statistics procedure, but also on the preprocessing of data. In our setting by integration of microarray analysis with other laboratory modalities (morphology, cytogenetics, molecular genetics, immunphenotyping) and clinical data the plausibility and consistency of results could be evaluated, therefore we are optimistic, that the demanding requirements for medical diagnostics can be fulfilled with microarray technology in the near future.
To assess the significance of each gene, a permutation test is performed, which determines signal-to-noise ratios when class labels are permuted randomly. To assess the robustness of the classifier, a leave-one-out crossvalidation is performed. Accuracy is the rate of correctly classified test samples.
The second top-ranked gene was represented by the Affymetrix probe set identifier: 38894_g_a. However, no clear gene assignment was possible for this informative prove set. Therefore, CAPN3 was chosen.
Acute myeloid leukemia (AML) is a heterogeneous group of genetically defined diseases. Their classification is important with regard to prognosis and treatment. We performed microarray analyses for gene expression profiling on bone marrow samples of 37 patients with newly diagnosed AML. All cases had either of the distinct subtypes AML M2 with t(8;21), AML M3 or M3v with t(15;17), or AML M4eo with inv(16). Diagnosis was established by cytomorphology, cytogenetics, fluorescence-in-situ hybridization, and RT-PCR in every sample. By using two different strategies for microarray data analyses, this study for the first time revealed a unique correlation between AML-specific cytogenetic, aberrations and gene expression profiles.
Acute myeloid leukemia (AML) is a heterogeneous group of diseases with respect to biology and clinical course. Since the introduction of the FAB-classification in 1976 diagnosis and classification have been based on cytomorphology and cytochemistry (1). As other techniques like immunophenotyping, cytogenetics, and molecular genetics contributed to the definition of AML subtypes the FAB-classification was updated. In 1999 the WHO classification for tumors of hematopoietic and lymphoid tissues was proposed. In an attempt to define biologically homogeneous entities which have clinical relevance morphologic, immunophenotypic, genetic and clinical features were incorporated (2, 3).
For optimal treatment approaches both a precise diagnosis and prognostic parameters that determine response to therapy and survival are needed. So far, the karyotype of the AML blasts is the most important independent prognostic factor. A favorable outcome under currently used treatment regimens with cure rates from 50% up to 85% was observed in several studies in patients with a) t(8;21)(q22;q22) occurring mostly in FAB subtype AML M2, b) inv(16)(p13q22) associated with AML M4eo and c) t(15;17)(q22;q11-12) associated with AML M3 and AML M3v(4-6). In contrast, chromosome aberrations with an unfavorable clinical course are −5/del(5q), −7/del(7q), inv(3)/t(3;3) and complex aberrant karyotypes with cure rates of less than 10% (7, 8). The remainder AML patients are assigned to a prognostically intermediate group. This latter group is very heterogeneous because it includes patients with a normal karyotype as well as those with rare chromosome aberrations and yet unknown prognostic impact.
Besides their prognostic impact genetic aberrations are involved in the pathogenesis of leukemia. While for unbalanced cytogenetic aberrations the heterogeneous pathogenetic mechanisms have not yet conclusively been determined, several studies provide strong evidence for the central pathogenetic role of leukemia-specific fusion genes that are generated by the above mentioned balanced abnormalities (9-12). Therefore it can be postulated that AML with balanced abnormalities most probably display a homogeneous gene expression profile and thus are promising candidates for microarray analyses.
In a pivotal study, gene expression profiles were analyzed in bone marrow samples of 27 ALL and 11 AML. A set of 50 genes out of 6,817 analyzed genes was sufficient to discriminate ALL and AML. By leave-one-out cross-validation it was possible to correctly classify 36 out of 38 acute leukemia cases. A class predictor could automatically determine new leukemia cases out of an independent test set as belonging to the myeloid or the lymphoid lineage. Thus, these results demonstrated the possibility of cancer classification based on gene expression profiling (13). In a further approach comparing AML with trisomy 8 and AML with normal karyotype expression profiling revealed fundamental biological differences in AML with isolated trisomy 8 and normal cytogenetics (14). More recently, acute lymphoblastic leukemias (ALL) with translocations involving the MLL gene could be separated from ALL cases without MLL translocations and from cases with AML by gene expression profiling (15).
The aim of our investigation was to answer the question whether a leukemia specific genotype is associated with a distinct gene expression profile. Therefore, we analyzed three distinct genetic subtypes of acute myeloid leukemia: t(8;21)(q22;q22), inv(16)(p13q22) and t(15;17)(q22;q12) which lead to subtype specific fusion genes AML1-ETO, CBFB-MYH11 and PML-RARA, respectively. They are specifically associated with four distinct morphological subtypes according to the FAB classification: AML M2, AML M4eo, AML M3 and AML M3v(16-18). We performed microarray analyses on a cohort of leukemia samples (n=37) and applied several methodologies to evaluate genes which allowed an assignment to the corresponding type of cytogenetic aberration for classification.
This is the first time that AML-specific cytogenetic aberrations can be correlated with corresponding gene expression profiles and vice versa.
For this investigation we selected bone marrow (BM) samples from 37 AML patients representing four morphological and three underlying cytogenetic subgroups. All cases were sent for reference diagnostics to our laboratory and registered in our leukemia database (19). Samples were received either locally or by overnight mail. All samples were newly diagnosed de novo AML and were characterized by cytomorphology, cytogenetics, FISH, and molecular genetics in each case. Gene expression analyses were performed on cells remaining from the diagnostic samples. Samples had been lysed immediately, frozen and were stored at −80° C. from one to 34 months until preparation for gene expression analysis.
Analysis was based on May-Grünwald-Giemsa stain, myeloperoxidase reaction, and non-specific esterase reaction using alpha-naphthyl-acetate. All staining from bone marrow and blood was performed routinely according to standard procedures (20). The cytomorphologic diagnosis followed the criteria of the FAB classification and the new WHO classification (1, 3, 18).
Chromosome analyses were performed on bone marrow or peripheral blood samples according to standard protocols (21). Metaphases were analyzed for G-bands using a modified GAG-banding technique as described elsewhere (22). Twenty to 25 metaphase cells were analyzed. The chromosomes were interpreted according to the International System for Human Cytogenetic Nomenclature (23).
FISH was performed on interphase nuclei on bone marrow smears or on slides prepared for cytogenetic analysis. For interphase-FISH at least 100 interphase nuclei were evaluated. FISH was carried out using commercially available AML1-ETO, PML-RARA and CBFB probes (VYSIS, Downers Grove, II, USA). The signals were evaluated with an AxioskopR (Zeiss, Jena, Germany). For documentation the analyzing system ISISR (MetaSystems, Altlussheim, Germany) was used.
Mononuclear cells were isolated by a Ficoll gradient separation. 1×107 cells were lysed in RLT-buffer (Qiagen, Hilden, Germany) and total RNA was extracted with a RNeasy-kit (Qiagen) according to the manufacturers instructions. RNA was eluted in 50 μl of elution buffer.
Five μl of the total RNA, an equivalent quantity of 1×106 cells or about 1 μg of RNA were reversely transcribed in a 40 μl reaction using 300 U of SuperscriptR (Life Technologies, Karlsruhe, Germany) and random hexamers (Pharmacia, Freiburg, Germany).
PCR for the specific AML1-ETO, CBFB-MYH11, or PML-RARA fusion transcripts were performed as has been described (24). For each sample an ABL specific RT-PCR was performed to control the integrity of RNA using primers ABL5′: 5-GGCCAGTAGCATCTGACTTTG-3′ and ABL3′: 5′-ATGGTACCAGGAGTGTTTCTCC-3′. Strict precautions were taken to prevent contamination. Water instead of cDNA was included as a blank sample in each experiment. Amplification products were analyzed on 1.5% agarose gels stained with ethidium bromide.
For microarray analysis the GeneChip® System (Affymetrix, Santa Clara, Calif.) was used. The targets for GeneChip® analysis were prepared according to the current Expression Analysis Technical Manual. Briefly, lysates of the leukemia samples were homogenized (QIAshredder, Qiagen, Hilden, Germany) and total RNA extracted (RNeasy Mini Kit, Qiagen). Normally, 10 μg total RNA isolated from 1×107 cells was used as starting material in the subsequent cDNA-synthesis using oligo[(dT)24T7promotor]65 primer (cDNA Synthesis System, Roche Diagnostics, Mannheim, Germany). The cDNA was purified by phenol:chlorophorm:IAA extraction (Ambion, Austin, Tex.) and acetate/ethanol precipitated over night. For detection of the hybridized target nucleic acid biotin-labeled ribonucleotides were incorporated during the in vitro transcription (Enzo® BioArray™ HighYield™ RNA Transcript Labeling Kit, ENZO, Farmingdale, USA). After quantification of the purified cRNA (RNeasy Mini Kit, Qiagen), 15 μg was fragmented by alkaline treatment (200 mM Tris-acetate, pH 8.2, 500 mM potassium acetate, 150 mM magnesium acetate) and added to the hybridization cocktail sufficient for 5 hybridizations on standard GeneChip® microarrays. Before hybridization onto U95Av2, Test3 microarrays (Affymetrix) were chosen for monitoring of the integrity of the cRNA. Washing and staining of the probe arrays were performed according to the current protocols (Micro—1v1, EukGE-WS2v2). The Affymetrix software (Microarray Suite, Version 4.0.1) extracted fluorescence intensities from each element on the microarrays as detected by confocal laser scanning according to the manufacturers recommendations. Thirty-two out of 37 hybridization cocktails demonstrated high quality cRNA characteristics (Test3 probe arrays: 3′/5′ ratio of GAPDH probe sets ≦3.0) and were selected for building up class prediction models.
Potential clusters corresponding to the genetic subgroups were visualized applying a two-step approach. The data were scaled from each array to a target intensity value 50 (Affymetrix Microarray Suite 4.0.1) in order to be able to perform inter-array comparisons. All data were permutated 100 cycles using the multiclass response parameter of the Significance Analysis of Microarrays algorithm (SAM) (25) (http://www-stat.stanford.edu/˜tibs/SAM/index.html). The total set of 12,600 genes was reduced to the significant differentially expressed genes. In a second step, the reduced set of genes was prepared for principal component analysis (PCA) and analyzed with J-Express (26) (http://www.molmine.com/). For visualization in a two-dimensional plot we chose the first two principal components as they captured most of the variation in the original data set.
We adapted a previously described method to reduce the number of candidate genes that could distinguish between the three different cytogenetic AML subgroups (13). Briefly, to avoid division by zero or negative numbers as occurs due to the expression algorithm (Affymetrix Microarray Suite 4.0.1) we set all average fluorescence intensities of 1 or less to 1. Then, gene expression levels were log-transformed. Performing pairwise comparisons (A vs. B), for each gene g P(g,c) values and votes (defined by: P(g,c)=(m1(g)−m2(g))/(s1(g)+s2(g))) were calculated based on mean expression levels (m) and standard deviations (s) in the respective cytogenetic subgroup. Subsequently, votes were summed and prediction strength (PS) values reflected the margin of victory in the direction of either cytogenetic group A or B of the pairwise comparison. PS values range between 0 and 1, values >0.45 demonstrate significance (according to the permutation test). The relevance of selected genes was assessed by performing leave-one-out cross-validation. Only those genes that were contained in all cross validation classifiers were considered important. To determine a random association between genes we performed a permutation test (100 cycles). Because the number of informative genes, which are required to discriminate between samples, is unknown, we applied this method for different numbers of informative genes (range: 2 to 200). The minimal set of genes which provided optimal classification accuracy together with the highest prediction strength was selected to avoid overfitting. To visualize the identified genes and check their suitability for class separation a hierarchical cluster analysis was performed utilizing J-Express (26) (cluster method: average linkage; distance metric: euclidean). The accuracy of this class prediction model was validated on an independent test set of five cases of AML not fulfilling the cRNA high quality criterion as outlined above.
As basic units in this classifier, classification trees are used (27-29). The optimal number of trees has been determined to be 15 (data not shown). Class votes of these trees are aggregated by a vote-by-majority rule. The classifier was fed with gene expression intensity values from a set of 973 genes that had been chosen based on their r statistic:
where μi refers to the class averages, μ to the overall average, σi to the within-class standard deviation, and summation is carried out over all k classes. The threshold was set to r>0.75. Classification trees were constructed as follows: tree building was performed while restricting trees to contain no more than n−1 nodes to discriminate between n classes. The C5.0 algorithm was used (28). The variables (gene expression intensities) used for tree construction were eliminated from the data set, and a new tree was calculated based on the truncated data set. This procedure was iterated until the predetermined number of trees had been reached. The accuracy of the multiple-tree classifier was estimated by 10-fold cross validation (30) and on an independent test set of data from 5 bone marrow aspirates, where the quality of the corresponding cRNA preparation was slightly lower than the high quality standards required for the training set.
We investigated 37 AML cases representing three defined cytogenetic aberrations corresponding to four FAB subtypes: t(8;21)(q22;q22)/AML M2 (n=9), t(15;17)(q22;q12)/AML M3 or AML M3v (n=10, n=8), and inv(16)(p13q22)/AML M4eo (n=10). All cases were characterized by cytomorphology, cytogenetics, FISH, and RT-PCR (
The gene expression profiles of 37 AML samples were evaluated. Thirty-two hybridization cocktails demonstrated high quality cRNA characteristics (Test3 probe arrays: 3′/5′ ratio of GAPDH probe sets <3.0) and were selected for building class prediction models: t(8;21)/AML M2 (n=7), t(15;17)/AML M3 or M3v (n=9, n=7), and inv(16)/AML M4eo (n=9). Five cases were primarily excluded (3′/5′ ratios ranging between 3.9 and 5.4, see Methods) and were used for subsequent validations of the class prediction models: t(8;21)/AML M2 (n=2), t(15;17)/AML M3 or M3v (n=1, n=1), and inv(16)/AML M4eo (n=1).
In order to visualize clusters corresponding to the three underlying genetic subgroups we applied a two-step approach. Based on a permutation test (100 permutations) we correlated our expression data to the three different cytogenetic parameters (25). We obtained 1000 significant genes. By principal component analysis we were able to clearly separate the three distinct chromosomal aberrations t(8;21), t(15;17), and inv(16) (
In order to identify the genes which enable the accurate discrimination of these subgroups, we applied the data analysis methodology introduced by Golub et al. (13). We selected the minimal set of genes which provided optimal classification accuracy together with the highest prediction strength to avoid overfitting. Thirteen genes were sufficient to separate these AML subtypes with high precision (Table 24; Table 24 shows that a minimal set of 13 genes (GenBank accession numbers are given) is sufficient for accurate class prediction with optimal classification accuracy and highest prediction strength. Comparisons (A vs. B) were performed either between two distinct subtypes or between one distinct subtype and all other subtypes (=remainder), respectively. As calculated from pairwise comparisons, positive P(g,c) values indicate a higher expression in first class listed, negative P(g,c) values a higher expression in second class listed, respectively). GenBank accession numbers and detailed descriptions of the genes are given in table 25 (Table 25: Thirty-six genes separate accurately three distinct cytogenetic AML subtypes. GenBank accession numbers, approved human gene nomenclature symbol (*=not approved) and description of the function are presented. Six genes are included in the minimal set of both weighted voting according to Golub et al. (13) (total=13) and multiple-tree classifiers (total=29).
All 32 clinical samples could be assigned to their corresponding cytogenetic subtype with best accuracy in leave-one-out cross-validation (1.0). Prediction strength values ranged from 0.91 to 0.98 (Table 24). To illustrate these results we applied hierarchical clustering (31). The resulting dendrogram clearly demonstrates the capacity of this subset of genes to separate all AML cases according to their cytogenetic aberration (
For external validation, we tested whether primarily excluded samples could also be accurately assigned to their specific cytogenetic category. Despite their non-optimal cRNA quality, all 5 cases were correctly classified with high prediction strength (0.76, 1.00, 1.00, 1.00, 1.00).
As a second and independent methodological approach we developed a multiple-tree classifier to separate the three genetically defined subtypes based on the expression level of a minimal set of genes. In short, we computed classification trees to discriminate between the different AML subclasses. To avoid overfitting of a singular tree model, we computed a multiple-tree model using an iteratively reduced set of genes. For each tree, we used only those genes that have not been used by the previously computed classification tree. The procedure is stopped when a predetermined number of trees has been reached. For this study, the optimal number of trees was calculated to be 15. The votes of the 15 trees were aggregated by a vote-by-majority rule. Equal votes for two of the three classes were counted as misclassification.
The classifier utilized the expression values of 29 genes (MYH11 was identified twice by two different probe sets; Table 25) to discriminate between three classes, namely samples displaying t(15;17), t(8;21), and inv(16) (
In summary, we identified 36 genes using two independent methodologies for class prediction in AML (Table 25). Six genes were described in both calculations, seven were found exclusively in the minimal set according to Golub et al. (13), and another 23 genes using multiple-tree classifiers.
We were able to demonstrate striking correlations between genotype and gene expression profiles in three genetically defined subgroups of AML. In addition, we answered the question, whether the cytogenetically identical AML with t(15;17) but appearing with two different phenotypes, AML M3 or AML M3v (
This is the first study to demonstrate an unequivocal association between disease-specific genetic alterations and distinct gene expression profiles. For each of the three analyzed clearly defined subtypes of AML (t(8;21), t(15;17), inv(16)) patterns of gene expression were identified that were homogeneous within all samples of the respective subgroups but clearly differed between these three subgroups. The analyzed samples represent disease subtypes that are specifically defined on the genetic and the phenotypic level by conventional diagnostics including cytomorphology, cytogenetics, and molecular genetics.
By applying two independent approaches for the analysis of microarray data, the present study demonstrates that AML samples from previously defined subtypes (3) can be classified adequately on the basis of gene expression profiles. It is intriguing that there is both sufficient coherence in gene expression within and difference between these subtypes to classify them with high accuracy even though the samples derive from the same myeloid cell lineage.
In order to correlate gene expression with cytogenetics Virtaneva et al. compared the expression status of 6,606 genes of AML blasts with normal cytogenetics and trisomy 8 as the sole abnormality. While in this study normal CD34+ cells clustered into a distinct group, AML with trisomy 8 and AML with normal karyotype intercalated with each other. Microarray analyses showed an overall increased gene expression of genes located on chromosome 8 suggesting a gene-dosage effect(14). AML with trisomy 8 is heterogeneous on the phenotypic level as it occurs in different FAB subtypes. In contrast, AML with t(15;17), inv(16) and t(8;21) show a very close correlation to distinct morphological subtypes. Furthermore, trisomy 8 is probably not a primary, disease-defining aberration leading to AML as it also occurs in addition to a variety of different cytogenetic and molecular genetic abnormalities (32, 33). In contrast to this study, Armstrong et al. compared samples of the more homogeneous group of ALL with MLL translocations to ALL without MLL translocations and to AML (15). They demonstrated that ALL with MLL translocations comprises a distinct disease which can be classified robustly by gene expression profiling.
The main focus of the present analyses was the assessment of the differences between three highly characterized subgroups of AML defined by specific primary chromosome aberrations. As anticipated, it was shown that AML with t(8;21) and AML with inv(16), which both involve alterations of the core binding factor-complex, are more related to each other as compared to AML with t(15;17) (34). Both phenotypically different subtypes of AML with t(15;17), AML M3 and AML M3v, cluster within one area. In an additional analysis, also the latter two subtypes were separated from each other based on their gene expression profiles. This data suggests the existence of further genetic and not yet identified alterations leading to the different phenotypes of AML M3 and AML M3v. One possible candidate gene is FLT3 which is mutated more frequently in AML M3v than in AML M3 (67% vs. 19%, P=0.001) (35).
Several studies confirmed that gene expression profiles can be used for class prediction. This has been shown for acute leukemias, round blue cell tumors, and malignant melanomas (13, 36-38) as well as for different types of solid tumors using multi-class cancer classification (39). While the selection of different subgroups in these studies was performed using exclusively phenotypic criteria, other studies were based on genetically defined entities (40, 41). In the present study not only the discrimination of the three genetically defined AML subgroups was accomplished but also all these cases of AML were separated from normal bone marrow (data not shown) (42).
To develop a classifier two independent approaches were applied. While classification by weighted voting according to Golub et al. (13) allows the discrimination between the three classes based on a minimal set of 13 genes, the multiple-tree classifier utilizes 30 genes. As indicated by cross-validation, generalization properties are excellent for the multiple-tree classifier, i.e. it is likely to perform equally well on new, unseen samples. Furthermore, it can be easily extended to more than the three subclasses described in the present study.
Our classifiers contained genes already known to be primarily involved in the pathogenesis of the respective entities, namely MYH11(43) and ETO(44).
Presumably, the detection of overexpression of MYH11 in inv(16) cases and of ETO in t(8;21) cases relates to the detection of the fusion gene transcripts rather than of the wild type transcripts. The other genes identified belong to various functional categories. Their potential pathogenetic significance in AML has to be clarified yet.
It is expected that the extension of the present analyses to currently less well-defined AML will identify additional subgroups of AML with clinical relevance based on their gene expression profiles. The feasibility of such an approach has been demonstrated for the first time for diffuse large B-cell lymphoma (45). Alizadeh et al. have subdivided an entity previously considered homogeneous by various pathological methods into two not only new but also prognostically highly relevant subgroups. In two recent studies, gene expression profiling also in breast cancer revealed subgroups significantly differing in their prognosis (46, 47). With regard to AML, this approach may be most promising in AML with normal karyotype. This subgroup cannot be further defined on the cytogenetic level and is characterized by an Intermediate prognosis possibly masking poor and favorable subgroups.
In addition, the current data may have major implications with regard to delineating aberrant gene expression pathways underlying the pathogenesis of AML. As has been shown in mantle cell lymphoma and medulloblastoma (48, 49) the extension of our analyses to all subgroups of AML should enable us to define the deregulated genes important for the initiation and the progression of AML. Finally, these analyses will promote the identification of new targets for specific treatment approaches.
24. Evans, P., Jack, A., Short, M., Haynes, A., Shiach, C., Owen, R., Johnson, R. & Morgan, G. J. (1995) Leukemia 9, 1285-1286.
36. Miyazato, A., Ueno, S., Ohmine, K., Ueda, M., Yoshida, K., Yamashita, Y., Kaneko, T., Mori, M., Kirito, K., Toshima, M. et al. (2001) Blood 98, 422-427.
Introduction
The determination of the surface and cytoplasmic expression of characteristic proteins by flow cytometry (FC) is a common method applied to the diagnosis and the subclassification of acute myeloid leukemias (AML)1. The oligonucleotide microarray analysis (MA) represents a novel technology for the simultaneous detection of the mRNA abundance of large numbers of genes2,3. Based on specific gene-expression patterns distinct disease entities have been identified4-6. Therefore MA may become of major importance as a diagnostic tool for AML in the near future7,8. However, up to now data on the correlation between protein expression levels and mRNA abundance are limited9-12. To analyze the relation of protein expression and mRNA abundance in AML we performed 450 individual comparisons of 29 genes in 25 patients with AML at diagnosis analyzed by FC and MA in parallel13.
Methods
Samples
Bone marrow samples from highly characterized patients with newly diagnosed and untreated AML were used. Samples had been analyzed by cytomorphology, cytochemistry, cytogenetics and molecular genetics in all cases and were characterized by either of the balanced chromosomal aberrations t(8;21), t(15;17), or inv(16) and the respective molecular and morphologic features7. The studies abide by the rules of the local Internal Review Board and the tenets of the revised Helsinki protocol.
Flow Cytometry
The studies were performed on cells isolated from bone marrow by Ficoll-Hypaque density gradient centrifugation as described previously14. Applying triple-stainings and isotype controls monoclonal antibodies against 29 antigens were used in the following combinations as designed for diagnostic purposes (conjugated with the fluorochromes FITC, PE, and PC-5, respectively): CD34/CD2/CD33, CD7/CD33/CD34, CD34/CD56/CD33, CD11b/CD33/CD34, CD64*/CD4/CD45, CD15*/CD13/CD33, HLA-DR/CD33/CD34, CD34/CD135/CD33, CD34/CD116/CD33, CD34/NG2/CD33, CD38/CD133**/CD34, CD61/CD14/CD45, CD36/CD235a/CD45, CD34/CD10/CD19, MPO***/LF***/cYCD15, TdT/cyCD22/cyCD3, TdT/cyCD79a/cyCD3. All antibodies were purchased from Immunotech (Marseilles, France), except for: *=Medarex (Annandale, N.J.); **=Milteny Biotech (Bergisch Gladbach, Germany); ***=Caltag (Burlingame, Calif.). The respective combinations of antibodies were added to 1×106 cells (volume, 100 μl) and incubated for ten minutes at room temperature. The samples were then washed twice in phosphate-buffered saline (PBS) and resuspended in 0.5 ml PBS. FC analysis was performed using a FACSCalibur flow cytometer (Becton Dickinson, San Jose, Calif.). Analysis of list-mode files was performed by means of the CellQuest Pro Software (Becton Dickinson). Antigen expression was rated positive at a cut-off level of 20% of the cells within the mononuclear gate for membrane proteins and at a cut-off level of 10% for cytoplasmic antigens. Mean fluorescence intensity values were calculated for all events with fluorescence values higher than isotype controls.
Microarray Experiments
For microarray analysis the GeneChip® System (Affymetrix, Santa Clara, Calif.) was used. The targets for GeneChip® analysis were prepared according to the current Expression Analysis Technical Manual. Briefly, lysates of the leukemia samples were homogenized (QIAshredder, Qiagen, Hilden, Germany) and total RNA extracted (RNeasy Mini Kit, Qiagen). Normally, 10 μg total RNA isolated from 1×107 cells were used as starting material in the subsequent cDNA-synthesis using oligo[(dT)24T7promotor]65 primer (cDNA Synthesis System, Roche Diagnostics, Mannheim, Germany). The cDNA was purified by phenol:chlorophorm:isoamylalcohol extraction (Ambion, Austin, Tex.) and acetate/ethanol precipitated overnight. For detection of the hybridized target nucleic acid biotin-labeled ribonucleotides were incorporated during the in vitro transcription (Enzo® BioArray™ HighYield™ RNA Transcript Labeling Kit, ENZO, Farmingdale, USA). After quantification of the purified cRNA (RNeasy Mini Kit, Qiagen), 15 μg were fragmented by alkaline treatment (200 mM Tris-acetate, pH 8.2, 500 mM potassium acetate, 150 mM magnesium acetate) and added to the hybridization cocktail sufficient for 5 hybridizations on standard GeneChip® microarrays. Before hybridization onto U95Av2, Test3 microarrays (Affymetrix) were chosen for monitoring of labelling efficiency and the integrity of the cRNA. Washing and staining of the probe arrays was performed according to the current protocols (Micro—1v1, EukGE-WS2v4). The Affymetrix software (Microarray Suite, Version 4.0.1) extracted, fluorescence intensities from each element on the microarrays as detected by confocal laser scanning according to the manufacturers recommendations. In order to be able to compare different experiments the global
microarray intensities were scaled to a common target intensity. Furthermore, the mRNA abundance of the genes was qualitatively rated as a) present, b) marginal, and c) absent calls, respectively.
Statistics
A total of 29 genes were analyzed in 25 patients with AML. The congruence of positivity and negativity of the expression of the respective genes as determined by FC and MA was analyzed for each gene in each individual patient. Comparisons of microarray intensities were performed by Mann-Whitney U-test. Analyses for bivariate correlations of mRNA and protein expression levels were performed by Pearson's correlation using SPSS, Version 10.0.7.
Results and Discussion
Twenty-five cases of AML were analyzed in parallel by FC and MA for the expression of 29 genes. Seven had AML M2 with t(8;21), 5 had AML M3 with t(15;17), 7 had AML M3v with t(15;17), and 6 had AML M4Eo with inv(16). A total of 450 comparisons of individual expression data obtained by both methods were performed. Of these, 399 (88.7%) revealed congruent results for protein expression and mRNA abundance (230 cases (51.1%) with positive expression and 169 cases (37.6%) with negative expression, respectively; table 26). In 30 comparisons (6.7%) MA detected positivity for mRNA expression (call: present) while the results of FC indicated negativity. In 21 cases (4.7%) protein expression was demonstrated by FC while no mRNA expression was detected by MA (call: absent).
Focussing on the genes most specific for the diagnosis of AML, i.e. myeloperoxidase, CD13, and CD33, a high correlation between protein expression and mRNA abundance was observed (congruence in 73 of 75 comparisons (97%)). In detail, all cases were rated positive for expression of myeloperoxidase and all but one were positive for both CD13 and CD33, respectively, by both methods. Furthermore, for most other genes essential for the subclassification of AML as well as for the distinction of AML from acute lymphoblastic leukemia and chronic leukemias the results obtained by both methods were always congruent (i.e., for CD10, CD22, CD7, CD133, CD116, CD11b, CD61, CD45, HLA-DR, NG2) or were congruent in the majority (117/140, 84%) of cases (CD79a, CD19, CD2, CD3, CD15, Lactoferrin, CD14, CD235a, CD135, CD34; Table 26).
Furthermore, the high correlations between protein expression and mRNA abundance were not limited to congruence in positivity but were significantly correlated also quantitatively. To proof this, the protein expression levels and mRNA abundance were compared by Pearson's correlation in genes expressed in the majority of the analyzed cases. These comparisons revealed significant correlations for the fluorescence intensities as assessed by FC and MA for CD13 (p=0.001), CD33 (p=0.034), CD34 (p=0.003), CD45 (p=0.015), CD15 (p=0.016), and CD7 (p=0.033) and thus further underline the high coherence of expression patterns for both protein and mRNA (
Thirty comparisons displayed mRNA expression and no protein expression. Due to the ongoing process of maturation (CD14, CD15) and due to the cross-lineage expression of the genes (CD3, CD19) the levels of mRNA abundance may have been to low to result in detectable protein expression levels using the described cut-off levels of 20% and 10%, respectively. This suggestion is supported by a quantitative analysis of mRNA expression data which shows relatively low albeit positive levels for the respective cases and genes (mean average fluorescence intensity, 46.7±54.5 in cases positive for CD14, CD15, CD3, or CD19 versus 389.4±831.0 in all positive cases, Mann-Whitney U-test: p<0.001) while at the same time protein expression amounts to a mean of 5±4%.
Twenty-one comparisons displayed positivity by FC and negativity of MA, which comprise 4.7% of all individual comparisons performed. These discrepancies most probably are due to: a) erythrocytic debris positive for CD36 interfering with the acquisition of CD36 negative cells during flow cytometric analysis; b) differences between both methods in the selected DNA sequences and antigen epitopes, respectively, detected (i.e. CD38, CD4, CD56); and c) differences in the stability of mRNA and protein of the respective genes.
Overall, these results demonstrate for the first time that there is a significant correlation between protein expression and gene expression in AML and that the antigens so far identified essential for the diagnosis and subclassification of AML by flow cytometry may represent additional candidate genes when using MA as a diagnostic tool for molecular cancer class prediction15,16. Furthermore, it is anticipated that the present analyses represent a prime example and will be reproduced for a variety of other entities like lymphoid malignancies. Due to their high potential to assess the expression patterns of high numbers of genes and due to their excellent reproducibility features microarrays are a promising future diagnostic tool. As a consequence, they may replace the more time and resource consuming diagnostic methods currently used for diagnosing leukemias like cytomorphology, cytogenetics, and FC.
Since their introduction, microarrays have been promising tools for basic research. With regard to leukemia, the pivotal discrimination of unselected acute lymphoblastic (ALL), and acute myeloid leukemia (AML) samples based on their gene expression signatures inspired numerous studies (Golub et al., 1999). We performed gene expression analyses to designate candidate genes for discriminating specific AML samples from normal bone marrow (BM) of healthy volunteers. With regard to the classification of hematological malignancies according to the WHO, distinct AML subtypes have been established based on genetic abnormalities of the leukemic blasts. Here, we demonstrate gene expression analyses of 8 healthy BM donors and 45 leukemia patients representing four cytogenetic subtypes of AML: t(8;21)(q22;q22), inv(16)(p13q22), t(15;17)(q22;q12), and t(11q23)/MLL. Combining different approaches for data analysis a minimal set of genes was identified to designate a reliable class prediction model. Based on the expression pattern of 39 genes, cytogenetically defined AML subtypes could accurately be predicted and separated from healthy BM. Taken together, gene expression signatures of AML cases with recurrent genetic abnormalities demonstrate a very close correlation between genotype and gene expression. Therefore, introducing a set of candidate genes, expression profiling may serve for diagnosis of AML subtypes defined by the new WHO classification.
We analyzed BM aspirates from 8 healthy volunteers and the following 45 untreated AML patients:
Microarray experiments. Gene expression analyses were performed from cells remaining from the diagnostic sample. They had immediately been lysed, frozen and were stored at −80° C. from 1 to 34 months until preparation for gene expression profiling. The targets for U95Av2 microarrays were prepared according to current protocols (Affymetrix). Before expression profiling, Test3 Probe Arrays were chosen for monitoring the integrity of the cRNA.
AML samples were thoroughly characterized by a combination of cytomorphology, cytogenetics, FISH, RT-PCR and quantitative real-time PCR (
For data analysis we combined different approaches. First, a reduced subset of 200 genes obtained by permutation-based neighborhood analysis (SAM, Tusher et al., 2001) was visualized for corresponding clusters using principal component analysis (J-Express, Dysvik et al., 2001) (
Next, we adapted the signal-to-noise/weighted voting algorithm (Golub et al., 1999) to identify discriminative genes. A minimal set of 39 genes, which provided both optimal classification accuracy and highest prediction strength, was selected to avoid overfitting. The significance of each gene was tested by permutation-based neighborhood analysis. The robustness of the classifier was assessed by leave-one-out crossvalidation. These expression signatures were sufficient to distinguish AML samples with high accuracies from normal bone marrow and to predict the recurrent chromosome aberration, respectively (Table 27,
A set of 39 genes is sufficient for class prediction. Accuracy denotes the rate of correctly classified test samples. P(g,c) indicates the signal-to-noise ratio of gene x: Sx=(μ1−μ2)/(ó1+ó2), where μk and ók denote the mean expression and standard deviation of gene x in group k. As calculated from pairwise comparisons (class A vs. B), positive P(g,c) values indicate a higher gene expression in class A, negative P(g,c) values a higher gene expression in class B, respectively. HGNC symbols are given in column 1.
All leukemia samples could accurately be assigned to their corresponding cytogenetic subtype with 100% accuracies. To illustrate these results, a hierarchical clustering is shown (
Here, we demonstrate gene expression analyses of 9 healthy BM donors and 271 leukemia patients representing:
AML: 4 distinct cytogenetic subtypes t(8;21)(q22;q22) (AML t(8;21)), inv(16)(p13q22) (AML inv(16)), t(15;17)(q22;q12) (AML t(15;17)), and t(1 q23)/MLL (AML MLL). In addition, we analyzed AML samples characterized by normal karyotypes (AML normal), complex aberrant karyotypes (AML complex), trisomy 8 as sole aberration (AML+8), and other chromosomal changes (AML other).
ALL: 3 distinct genetically defined subtypes: t(4;11)(q21;q23) (ALL t(4;11)), t(8;14)(q24;q32) (ALL t(8;14)), t(9;22)(q34;q11) (ALL Ph) and 2 subtypes defined by their immunophenotype: ALL of the B-lineage not carrying the t(9;22) (ALL B not Ph) and T-ALL (T-ALL)
CLL: 5 genetically defined subtypes: trisomy 12 (tri 12), deletion 11q (11q-), deletion 13q (13q-), deletion 17p (17p-) and none of these aberrations (normal)
CML (CML) without any further subdivision and
Normal bone marrow from healthy volunteers (normal BM).
We used the Affymetrix oligonucleotide microarray technology (GeneChip® Instrument System) to obtain gene expression profiles of each individual clinical sample of interest. The commercially available HG-U133 probe arrays gave information about the relative mRNA abundance of about 33,000 human genes which are represented on these high-density DNA-oligonucleotide microarrays.
Chip Information (as provided by manufacturer):
The GeneChip® Human Genome U133 Set (HG-U133A and HG-U133B) is comprised of two microarrays containing over 1,000,000 unique oligonucleotide features covering more than 39,000 transcript variants, which in turn represent greater than 33,000 of the best characterized human genes. This powerful set allows to reproducibly examine the quantitative and qualitative expression of most genes in the human genome, and was designed using the recently published and publicly available draft of the human genome sequence. Sequences used in the design of the array were selected from GenBank, dbEST, and RefSeq. Sequence clusters were created from Build 133 of UniGene (Apr. 20, 2001) and refined by analysis and comparison with a number of other publicly available databases including the Washington University EST trace repository and the University of California, Santa Cruz golden-path human genome database (April 2001 release). In addition, ESTs were analyzed for untrimmed low-quality sequence information, correct orientation, false priming, false clustering, alternative splicing and alternative polyadenylation.
Combining different approaches for data analysis, a set of genes was identified to designate a reliable class prediction model. Based on the expression pattern of those genes, defined leukemia types and subtypes could accurately be predicted and separated from healthy BM. Taken together, gene expression signatures demonstrate a very close correlation between genotype and gene expression.
Therefore, introducing a set of candidate genes, measurements of mRNA abundancies by gene expression profiling serves for diagnosis of leukemia types and subtypes.
We analyzed BM aspirates from 9 healthy volunteers and the following 280 leukemia patients:
Acute myeloid leukemia (AML)
t(8;21)(q22;q22)/AML M2 (n=13),
t(15;17)(q22;q12)/AML M3/M3v (n=20),
inv(16)(p13q22)/AML M4eo (n=12),
t(11q23)/MLL-aberrations (n=15)
trisomy 8 (n=10)
normal karyotype (n=62)
complex aberrant karyotype (n=36)
other aberrations (n=5)
Acute lymphoblastic leukemia (ALL)
t(4;11)(q21;q23) (n=9)
t(8;14)(q24;q32) (n=4)
t(9;22)(q34;q11) (ALL Ph) (n=15)
ALL B lineage without t(9;22) (ALL B not Ph) (n=9)
T-ALL (n=9) Chronic lymphocytic leukemia (CLL)
trisomy 12 (tri 12) (n=5)
deletion 11q (11q-) (n=4)
deletion 13q (13q-) (n=10)
deletion 17p (17p-) (n=4)
none of these aberrations (normal) (n=9)
Chronic myeloid leukemia (n=14)
Normal bone marrow (normal BM) (n=9)
We selected bone marrow (BM) samples from 271 leukemia patients at diagnosis representing 18 different disease entities or subentities and from 9 healthy volunteers, respectively. All cases were sent for reference diagnostics to our laboratory, registered in our leukemia database and were treated within prospective randomized multi-center trials. The studies abide by the rules of the local internal review board and the tenets of the revised Helsinki protocol. Samples were received either locally or by overnight mail. Diagnosis was performed by an individual combination of cytomorphology, cytogenetics, FISH, immunophenotyping and molecular genetics. Mononuclear cells were isolated by a Ficoll gradient, lysed, frozen and were stored at −80° C. from one to 34 months until sample preparation for gene expression analysis. All leukemia samples were thoroughly characterized by a individual combination of cytomorphology, cytogenetics, immunophenotyping, fluorescence in situ hybridisation (FISH), polymerase chain reaction based methods both qualitative RT-PCR and quantitative real-time PCR. Using FISH analysis, more than 90% of cells demonstrated the specific signal constellation. The respective fusion transcripts BCR-ABL in t(9;22) positive CML (Schoch et al. 2002a) and in t(9;22) positive ALL, AML1-ETO in AML with t(8;21), CBFbeta-MYH11 in AML with inv(16), PML-RARalpha in AML with t(15;17) (Schoch et al. 2002b) and various MLL-fusion partners in both AML and ALL with t(11q23) were detected by FISH and PCR techniques in all samples.
In t(8;14) positive ALL the IGH-C-MYC rearrangement was confirmed by FISH. In all cases with AML and complex aberrant karyotype 24 color FISH was performed in addition to chromosome banding analysis (Schoch et al. 2002c).
Genetic subtyping of CLL was carried out using interphase FISH with the following probes (Buhmann et al. 2002):
Microarray analyses were performed utilising the GeneChip® System (Affymetrix, Santa Clara, USA). The targets for GeneChip® analyses were prepared according to the current Expression Analysis Technical Manual. Briefly, lysates of the leukemia samples were homogenised (QIAshredder, Qiagen, Hilden, Germany) and total RNA extracted (RNeasy Mini Kit, Qiagen). Normally, 5 μg total RNA isolated from 1×107 cells were used as starting material in the subsequent cDNA-synthesis using oligo[(dT)24T7promotor]65 primer (cDNA Synthesis System, Roche Applied Science, Mannheim, Germany). The cDNA was purified by phenol:chloroform:isoamyl alcohol (25:24:1) extraction (Ambion, Austin, USA) and acetate/ethanol precipitated over night. For detection of the hybridised target nucleic acid biotin-labeled ribonucleotides were incorporated during the in vitro transcription (Enzo® BioArray™ HighYield™ RNA Transcript Labeling Kit, ENZO, Farmingdale, USA). After quantification of the purified cRNA (RNeasy Mini Kit, Qiagen), 15 μg labeled cRNA were fragmented by alkaline treatment (200 mM Tris-acetate, pH 8.2, 500 mM potassium acetate, 150 mM magnesium acetate) and added to the hybridisation cocktail sufficient for 5 hybridisations on standard format GeneChip® microarrays. Before hybridisation to HG-U133 microarrays, Test3 microarrays (Affymetrix) were chosen in some cases for monitoring the integrity of the cRNA. Washing and staining of the probe arrays was performed according to the current protocols of the manufacturer (Fluidics Station, Micro—1v1, EukGE-WS2v4). The Affymetrix software (Microarray Suite, Version 5.0) extracted fluorescence intensities from each feature on the microarrays as detected by confocal laser scanning according to the manufacturers recommendations. Some of the hybridization cocktails had previously been hybridized to U95Av2 arrays. Hybridization cocktails can be used for up to 5 distinct array analyses.
All hybridisation cocktails demonstrated high quality cRNA characteristics. We considered both low 3′/5′ ratio (e.g., lower than about 3) of housekeeping controls and the total number of present called genes (> about 30% on U133A), along with the average signal intensity of a present called gene. Expression profiles which fulfilled all quality control criteria were selected for subsequent supervised selection of informative genes.
For data analysis we combined different approaches. First, the expression data was preprocessed. Raw expression intensities were scaled using the Affymetrix Microarray Suite software scaling parameter (target intensity: 5000). This preprocessing is based on a mask file which compares expression intensities of a set of 100 genes which code for ubiquitous housekeeping cellular proteins. This set of genes for normalisation of expression intensities is represented on both U133A and U133B arrays. The step of data preprocessing assures that array experiments can be compared properly using further statistical algorithms and methods. Subsequently, the data was analyzed according to two different established methods from as described below. The results from the two analyses were systematically compared to validate the list of differentially expressed genes.
1. Selection of Differentially Expressed Genes
a) Analysis According to Example 3.
The top 20 differentially expressed genes were calculated for all disease entities and normal bone marrow, respectively, as described in example 3. Expression data were analyzed in order to select a minimal set of discriminative genes, which provides, as described hereinabove (Example 3), maximum classification accuracy in leave-one-out-crossvalidation.
One-versus-all (OVA) and all-pairs comparisons (AP) were systematically applied. Genes were ranked according to signal-to-noise ratio (STN). For each OVA and AP comparison a set of discriminative genes is disclosed in tables 29, 32, 35, 38 and 41 whereby the gene names can be found in tables 43a,b. The most discriminative and informative genes are marked by asterisks in tables 29, 32, 35, 38 and 41. Classification accuracy was estimated by means of leave-one-out-crossvalidation and weighted voting.
A set of 20 top-ranked genes, which provided both optimal classification accuracy and highest prediction strength, was selected to avoid overfitting. The significance of each gene was tested by permutation-based neighborhood analysis. The robustness of the classifier was assessed by leave-one-out crossvalidation. These expression signatures were sufficient to distinguish leukemia samples with high accuracies from normal bone marrow and also to predict the recurrent chromosome aberration, respectively (Tables 29, 32, 35, 38, 41). Accuracy denotes the rate of correctly classified test samples. P(g,c) indicates the signal-to-noise ratio of gene x: Sx=(μ1−μ2)/( ó1+ó2), where μk and ók denote the mean expression and standard deviation of gene x in group k. As calculated from pairwise comparisons (class A vs. B), positive P(g,c) values indicate a higher gene expression in class A, negative P(g,c) values a higher gene expression in class B, respectively.
B) Analysis According to Westfall & Young the Same Data Set was Analysed According to Westfall & Young to Identify Significantly Differentially Expressed Genes with Adjustment of the P-Values for Multiple Testing.
Step-down maxT and minP multiple testing procedures were applied, which compute permutation adjusted p-values for the step-down maxT and minP multiple testing procedures, which provide strong control of the family-wise Type I error rate (FWER). The multitest package (version 1.0) from Bioconductor was applied, which is based on the R statistical language. These methods outperform other methods (see Dudoit, JASA 2002).
Package multtest (version 1.0)
from Bioconductor http://www.bioconductor.org
R statistical language: http://www.r-project.org/
C) Comparison of Gene Lists
The list of differentially expressed genes obtained from 1a) and 1b) were systematically compared using PERL scripts in order to identify genes that occurred in both list, versus genes occurring in one list only.
Expression intensities (expression levels) derived from the above-mentioned MicroArray Suite program were plotted as bar graphs showing gene expression profiles using a Per script (FIGS. 24 to 464).
Sensitivities for the detection of leukemia types and subtypes were calculated as the number of positive samples predicted divided by the number of true positives.
Specificities for the detection of leukemia types and subtypes were calculated as the number of negative samples predicted divided by the number of true negatives.
Here we analyzed in total 14 distinct leukemia types and subtypes as well a cohort of healthy volunteers for normal bone marrow characteristics. We applied the described two different statistical methods for identification of genes which allow accurate class assignments to the respective groups.
ALL t(4; 11) (n=9)
ALL t(8;14) (n=4)
ALL B not Ph (n=9)
ALL Ph (n=15)
T-ALL (n=9)
AML+8 (n=10)
AML complex (n=36)
AML normal (n=62)
AML t(8;21) (n=13)
AML(15;17) (n=20)
AML inv(16) (n=12)
AML MLL (n=15)
CLL (n=32)
CML (n=14)
normal BM (n=9)
total: 269 samples
First, expression data were analyzed according to example 3, as described hereinabove.
A set of 20 top-ranked genes, which provided both optimal classification accuracy and highest prediction strength for all pairwise (all pairs) and one-versus-all comparisons is given as table 29. Within this set of genes, optimal classification accuracy can be obtained with genes marked by asterisks. Gene expression intensities, plotted as bar graphs are given in FIGS. 24 to 188. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in tables 43a,b.
In total 269 cases with leukemia or normal bone marrow (BM) were analyzed. 248 of 269 (92.2%) cases were assigned to the correct leukemia type in all pairwise comparisons (table 28 b). The sensitivity indicated for each subgroup indicates the percentage of cases of the specific subgroup identified correctly in all pairwise comparisons (range 60% to 100%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 85.3% to 100%).
In total 3766 individual assignments of leukemia and normal bone marrow were analyzed. 3745 of 3766 assignments (99.4%) were correct (table 28c). The sensitivity indicated for each subgroup indicates the percentage of correct assignments for cases of the specific subgroup in pairwise comparisons. (range 97.1% to 100%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 98.4% to 100%).
In a second approach significant genes were identified according to Westfall & Young. Table 30 represents all genes found to be significant after p-value adjustment. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
Furthermore, we provide information about genes which were found to be rated significant independently by both methodologies (Table 30). Top-significant genes according to the method of example 3 are marked by asterisks. Genes which were included in any of the top-20 lists are marked by positive signs.
In addition, selected gene profiles were chosen to demonstrate their capability of discriminating different leukemia types, subtypes and normal bone marrow, respectively. Gene expression profiles were generated by means of PERL-programs, evaluated and plotted as bar graphs. Each of the analyzed groups are accordingly outlined. The following genes were selected and are given as FIGS. 189 to 233:
Generally, chromosomal aberrations are strongly associated with morphological characteristics. However, there are two chromosomal aberrations which are observed in both myeloid and lymphatic neoplasms, i.e. t(11q23)/MLL and the t(9;22). The t(9;22) occurs in ALL (ALL Ph) and CML, and t(11q23)/MLL is observed in ALL (ALL t(4;11)) and AML (AML MLL), respectively. Analysing gene expression signatures of both t(9;22) positive ALL and CML we identified genes, which allowed correct lineage assignments (table 29). In addition, our results indicate that the distinct expression signatures are also sufficient for correct assignments of the t(11q23)/MLL positive leukemias either to ALL or to AML (table 29). Thus, in both scenarios lineage assignment (lymphoid or myeloid), and even subtype classification can be accomplished based on the methods and markers described herein, despite of the fact that e.g., in the above-noted t(11q23) and t(9;22) chromosomal aberrations, the same chromosomal aberration is associated with different kinds of leukemia.
Here we analyzed in 5 distinct ALL subtypes. We applied the described two different statistical methods for identification of genes which allow accurate class assignments to the respective groups.
First, expression data were analyzed according to example 3, as described hereinabove.
A set of 20 top-ranked genes, which provided both optimal classification accuracy and highest prediction strength for all pairwise (all pairs) and one-versus-all comparisons is given in table 32. Within this set of genes, optimal classification accuracy can be obtained with genes marked by asterisks. Gene expression intensities, plotted as bar graphs are given in FIGS. 234 to 252. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
In total 46 cases of ALL were analyzed. 44 of 46 cases (95.7%) were assigned to the correct ALL subtype in all pairwise comparisons (table 31a). The sensitivity indicated for each subgroup indicates the percentage of cases of the specific subgroup identified correctly in all pairwise comparisons (range 88.9% to 100%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 88.9% to 100%).
In total 184 individual assignments of ALL were analyzed. 182 of 184 assignments (98.9%) were correct (table 31b). The sensitivity indicated for each subgroup indicates the percentage of correct assignments for cases of the specific subgroup in pairwise comparisons. (range 97.2% to 100%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 97.2% to 100%).
In a second approach significant genes were identified according to Westfall & Young. Table 33 represents all genes found to be significant after p-value adjustment. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
Furthermore, we provide information about genes which were found to be rated significant independently by both methodologies (Table 33). Top-significant genes according to the method of example 3 hereinabove are marked by asterisks. Genes which were included in any of the top-20 lists are marked by positive signs.
In addition, selected gene profiles were chosen to demonstrate their capability of discriminating different leukemia types, subtypes and normal bone marrow, respectively. Gene expression profiles were generated by means of PERL-programs, evaluated and plotted as bar graphs. Each of the analyzed groups are accordingly outlined. The following genes were selected and are given as FIGS. 253 to 271:
Here we analyzed in total 8 distinct AML subtypes. We applied the described two different statistical methods for identification of genes which allow accurate class assignments to the respective groups.
First, expression data were analyzed according to example 3 as described hereinabove.
A set of 20 top-ranked genes, which provided both optimal classification accuracy and highest prediction strength for all pairwise (all pairs) and one-versus-all comparisons is given as table 35. Within this set of genes, optimal classification accuracy can be obtained with genes marked by asterisks. Gene expression intensities, plotted as bar graphs are given in FIGS. 272 to 336. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
In total 173 cases of AML were analyzed. 160 of 174 cases (92.5%) were assigned to the correct AML subtype in all pairwise comparisons (table 34a). The sensitivity indicated for each subgroup indicates the percentage of cases of the specific subgroup identified correctly in all pairwise comparisons (range 60% to 100%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 85.5% to 100%).
In total 1211 individual assignments of AML were analyzed. 1198 of 1211 assignments (98.9%) were correct (table 34b). The sensitivity indicated for each subgroup indicates the percentage of correct assignments for cases of the specific subgroup in pairwise comparisons (range 94.3% to 100%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 97.7% to 100%).
In a second approach significant genes were identified according to Westfall & Young. Table 36 represents all genes found to be significant after p-value adjustment. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
Furthermore, we provide information about genes which were found to be rated significant independently by both methodologies (Table 36). Top-significant genes according to the method of example 3 are marked by asterisks. Genes which were included in any of the top-20 lists are marked by positive signs.
In addition, selected gene profiles were chosen to demonstrate their capability of discriminating different leukemia types, subtypes and normal bone marrow, respectively. Gene expression profiles were generated by means of PERL-programs, evaluated and plotted as bar graphs. Each of the analyzed groups are accordingly outlined. The following genes were selected and are given as FIGS. 337 to 370:
Here we analyzed in total 5 genetically defined CLL subtypes. We applied the described two different statistical methods for identification of genes which allow accurate class assignments to the respective groups.
First, expression data were analyzed according to example 3 as described hereinabove.
A set of 20 top-ranked genes, which provided both optimal classification accuracy and highest prediction strength for all pairwise (all pairs) and one-versus-all comparisons is given as table 38. Within this set of genes, optimal classification accuracy can be obtained with genes marked by asterisks. Gene expression intensities, plotted as bar graphs are given in FIGS. 371 to 404. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
In total 32 cases of CLL were analyzed. 31 of 32 cases (96.9%) were assigned to the correct CLL subtype in all pairwise comparisons (table 37a). The sensitivity indicated for each subgroup indicates the percentage of cases of the specific subgroup identified correctly in all pairwise comparisons (range 90% to 100%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 90% to 100%).
In total 128 individual assignments of CLL were analyzed. 127 of 128 assignments (99.2%) were correct (table 37b). The sensitivity indicated for each subgroup indicates the percentage of correct assignments for cases of the specific subgroup in pairwise comparisons (range 97.5% to 100%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 97.3% to 100%).
In a second approach significant genes were identified according to Westfall & Young. Table 39 represents all genes found to be significant after p-value adjustment. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
Furthermore, we provide information about genes which were found to be rated significant independently by both methodologies (Table 39). Top-significant genes according to the method of example 3 are marked by asterisks. Genes which were included in any of the top-20 lists are marked by positive signs.
Here we analyzed in total 4 major leukemia types as well a cohort of healthy volunteers for normal bone marrow characteristics. We applied the described two different statistical methods for identification of genes which allow accurate class assignments to the respective groups.
ALL (n=47)
AML (n=175)
CLL (n=35)
CML (n=14)
Normal bone marrow (n=9)
First, expression data were analyzed according to example 3 as described hereinabove.
A set of 20 top-ranked genes, which provided both optimal classification accuracy and highest prediction strength for all pairwise (all pairs) and one-versus-all comparisons is given as table 41. Within this set of genes, optimal classification accuracy can be obtained with genes marked by asterisks. Gene expression intensities, plotted as bar graphs are given in FIGS. 405 to 431. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
In total 280 cases of leukemia and normal bone marrow (BM) were analyzed. 263 of 280 cases (93.9%) were assigned to the correct leukemia subtype or normal bone marrow in all pairwise comparisons (table 40a). The sensitivity indicated for each subgroup indicates the percentage of cases of the specific subgroup identified correctly in all pairwise comparisons (range 76.6% to 98.3%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 88.9% to 97.1%).
In total 1120 individual assignments of leukemia subtype or normal bone marrow were analyzed. 1103 of 1120 assignments (98.5%) were correct (table 40b). The sensitivity indicated for each subgroup indicates the percentage of correct assignments for cases of the specific subgroup in pairwise comparisons (range 94.2% to 99.3%). The specificity indicates for each subgroup the percentage of correct assignments to this subgroup (range 97.2% to 99.3%).
In a second approach significant genes were identified according to Westfall & Young. Table 42 represents all genes found to be significant after p-value adjustment. Genes are depicted as unique Affymetrix identifier (for example 201497_x_at) and, where available, approved HGNC symbols (HUGO Gene Nomenclature Committee). More detailed, the complete annotation and sequence information about this set of genes is listed in table 43a,b.
Furthermore, we provide information about genes which were found to be rated significant independently by both methodologies (Table 42). Top-significant genes according to the method of example 3 are marked by asterisks. Genes which were included in any of the top-20 lists are marked by positive signs.
In addition, selected gene profiles were chosen to demonstrate their capability of discriminating different leukemia types, subtypes and normal bone marrow, respectively. Gene expression profiles were generated by means of PERL-programs, evaluated and plotted as bar graphs. Each of the analyzed groups are accordingly outlined. The following genes were selected and are given as FIGS. 432 to 464:
Tables 43a, b: functional gene annotation for genes identified to be differentially expressed between different types of leukemia, or between healthy bone marrow and leukemia, respectively.
As described by the GeneChip manufacturer, for each probeset (for example 200093_s_at_HG-U133A), a GenBank or RefSeq accession number was chosen to represent the target sequence. Using this accession number, a UniGene cluster (in current release) was identified where the accession number was used. If there is a link to LocusLink in the UniGene record, then annotations were retrieved from LocusLink. Those annotations include gene symbol, location, OMIM, EC, Gene Ontology (GO), description and RefSeq sequence accession. The RefSeq accession was linked to the protein annotations, which include domain identification (Pfam and BLOCKS), similarity search (blastp nr) and family classification (SCOP, EC and GPCR HMM searches).
Target sequence information for all the probes which were identified to be able to distinguish between different types and subtypes of leukemia and normal bone marrow, respectively, are given in Table 44.
As further described by the GeneChip manufacturer, the HG-U133 Target Databank is a compilation of probe set annotations and target sequence information for all the probes represented on the HG-U133 A and B arrays. Target sequences are the relatively short (typically around 300-600 bp) sequences against which probes have been designed on a GeneChip® array. These target sequences can be thought of as a subsequence of the Consensus/Exemplar sequence.
The Consensus/Exemplar sequences (i.e., the coding or full cDNA sequences corresponding to the markers described herein as being able to distinguish between different types and subtypes of leukemia and normal bone marrow) for most markers are given in Table 45.
The expression pattern of genes allowed precise class assignments of defined leukemia types and subtypes according to the WHO classification of hematological malignancies, and normal BM, respectively.
Thus, we introduce candidate genes suitable for diagnosis of leukemia types and subtypes based on gene expression profiling.
These data demonstrate the utility of gene expression profiling for the discrimination of all leukemia major entities and most subentities. In total, up to 14 different leukemia types and subtypes could clearly be distinguished from each other and from normal BM, respectively. These leukemias are associated with highly differing prognoses and require specific treatment strategies. By performing these analyses on a single platform requiring basic molecular biological methods, this approach provides a broad access to high-quality diagnosis of leukemia.
Number | Date | Country | Kind |
---|---|---|---|
01126244.1 | Nov 2001 | EP | regional |
02009758.0 | Apr 2002 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP02/12303 | 11/4/2002 | WO | 11/19/2004 |