Method of predicting cancer

TECHNICAL FIELD

The present invention relates to a method for predicting cancer and a drug design method. Particularly, the present invention relates to a cancer prediction method useful in genetic diagnosis for evaluating the malignancy of cancer. The invention also relates to a drug design method utilizing the results of the above prediction method.

BACKGROUND ART

Various solid cancers such as breast cancer and colon cancer have different grade of malignancy depending on the individual case. As the various degrees of malignancy of individual cases require different methods of treatment, predicting prognosis is extremely important. Currently, cancer prognosis is performed by e.g. image analysis such as CT and X-ray, pathologic analysis such as tissue typing, and analysis utilizing a tumor marker. For example, CEA is well known as a molecular tumor marker for breast and colon cancers. This marker is not quite satisfactory for cancer diagnosis, however, because of its low sensitivity for early cancer and because in many cases detection of the cancer is possible only after the cancer is at an advanced stage. In addition, various methods of predicting cancer malignancy have been developed, but they only provide partial correlation with malignancy and their prediction results have not been satisfactory.

Recently, thanks to technologies such as DNA chips, it has become possible to systematically analyze the expression patterns of genes. As a result, it looks more likely than ever that cancer malignancy can be predicted on the basis of gene expression patterns.

On the other hand, it has been revealed that cancer is a disease caused by genetic abnormalities. In the field of clinical medicine, attention is being focused on genetic diagnosis of cancer based on a search for the responsible genes and detection of their abnormalities. Such genetic diagnosis of cancer is in great demand as a means of predicting the risks resulting from cancer, so that cancer can be prevented or treated in early stages.

DISCLOSURE OF THE INVENTION

The object of the invention is to provide a method for predicting cancer and a drug design method.

The present inventors, after extensive work with a view to achieving the above objective, have succeeded in predicting cancer based on the result of multivariate analysis of expression levels of genes obtained from a primary cancerous lesion and thus have succeeded in completing the invention.

The invention provides a method for classifying cancer which comprises the steps of:

(a) collecting genes from specimens and measuring an expression level of the genes;

(b) selecting at least one of the measured genes;

(d) classifying the specimens into groups with similar gene expression patterns by using the result of multivariate analysis as an indicator.

The present invention also provides a method for predicting cancer which comprises the steps of:

(a) collecting genes from specimens and measuring an expression level of the genes;

(b) selecting at least one of the measured genes;

(d) classifying the specimens into groups with similar gene expression patterns by using the result of multivariate analysis as an indicator; and

(e) predicting the state of cancer based on the result of classification.

The prediction method may include steps of determining an expression pattern characteristic of a particular state of cancer and comparing it with the expression patterns of genes collected from a cancer specimen on which cancer prediction is to be performed.

The states of cancer include at least one selected from the group consisting of the presence or absence of cancer, malignancy of cancer, presence or absence of metastasis of cancer, and presence or absence of recurrence of cancer. Metastasis of cancer includes lymph node metastasis, and recurrence includes early recurrence.

Examples of the selected genes include those of gene group I containing nucleotide sequences 1-27 from Table 1, those of gene group II containing nucleotide sequences 28-153 of Table 2, and those of gene group III containing nucleotide sequences 154-289 of Table 3. The selected genes may also include combinations of at least one gene selected from the group consisting of gene group I containing nucleotide sequences 1-27 of Table 1, gene group II containing nucleotide sequences 28-153 of Table 2, and gene group III containing nucleotide sequences 154-289 of Table 3, and at least one gene other than those of gene groups I, II and III.

One example of specimen classification employs a hormone receptor-positive group and/or a hormone receptor-negative group as an indicator. One example of the hormone receptor is estrogen receptor.

Examples of cancer include breast cancer, stomach cancer, esophageal cancer, oral cancer, colon cancer, rectal cancer, anal cancer, pancreatic cancer, lung cancer, renal cancer, bladder cancer, ovarian cancer, uterine cancer, skin cancer, melanoma, central nervous tumor, peripheral nervous tumor, gum cancer, pharyngeal cancer, maxillary and jowl cancer, liver cancer, prostate cancer, leukemia, multiple myeloma, and malignant limphoma. Particularly, breast cancer and colon cancer are preferable.

Multivariate analysis can be performed by cluster analysis.

The present invention further provides a drug design method, comprising designing a drug for suppressing the expression of a gene that is expressed in a specimen whose state of cancer has been predicted to be at high-risk by the above prediction method. Examples of such gene include genes having nucleotide sequences 4, 7 and 20 of Table 1, genes having nucleotide sequences 28, 29, 31, 32, 35, 43, 49-53, 67, 70, 72, 73, 75-79, 81, 84, 86-92, 94-99, 104-111, 113, 114, 117, and 122-153 from Table 2, and genes having nucleotide sequences 155, 162, 163, 167-169, 171, 172, 174, 175, 177-180, 188, 190, 193, 198, 211, 222, 242-253, 255-257, 259-261, 263 and 265 from Table 3, or combinations thereof. One example of the drug for suppressing the expression of the above-mentioned gene is an antisense nucleic acid. The present invention further provides a drug design method, comprising designing a drug for enhancing the expression of a gene that is expressed in a specimen whose state of cancer has been predicted to be at high-risk by the above prediction method. Such genes include genes having nucleotide sequences 1, 2, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 and 21 of Table 1, genes having nucleotide sequences 30, 33, 34, 36-62, 44-48, 54-66, 68, 69, 71, 74, 80, 82, 83, 85, 93, 100-103, 112, 115, 116 and 118-121 of Table 2, and genes having nucleotide sequences 154, 156-161, 164-166, 170, 173, 176, 181-187, 189, 191, 192, 194-197, 199-210, 212-221, 223-241, 254, 258, 262, 264 and 266-289 of Table 3, or combinations thereof. One example of the drug for enhancing the expression of any of the above genes is a targeting vector in which the particular gene has been incorporated.

The invention also provides a program for having a computer function as a cancer state predicting system comprising a means of analyzing the expression level of a gene collected from a primary cancerous lesion and a means of identifying the state of the cancer by using the result of analysis as an indicator.

The present invention further provides a computer-readable recording medium in which is stored a program for having a computer function as a cancer state predicting system comprising a means of analyzing the expression level of a gene collected from a primary cancerous lesion and a means of identifying the state of the cancer by using the result of analysis as an indicator.

The present invention will be hereafter described in detail. This application claims priority from Japanese Patent Application Nos. 2001-73063 filed on Mar. 14, 2001, 2001-108503 filed on Apr. 6, 2001, and 2001-234807 filed on Aug. 2, 2001. The present specification includes part or all of the contents as disclosed in the specification and/or drawings of the above applications.

The method of the present invention is characterized in that specimens are classified into several groups according to differences in expression patterns of a particular gene, wherein an expression pattern characteristic of a state of cancer is determined based on the result of classification. The method of the invention is summarized in FIG. 1. First, a number of specimens, both normal and cancerous, are collected (see FIG. 1(e)), and the expression level of genes deriving from a primary cancerous lesion in these specimens is measured (see FIG. 1(f)). The measurement of the gene expression levels in these specimens is performed for all of the genes selected by searching the literature, for example (see FIG. 1(c)). Next, genes useful for multivariate analysis are selected from the genes for which the expression level has been measured. The selected genes are subjected to data analysis such as multivariate analysis (see FIG. 1(g)), and the specimens are classified into a small number of groups of similar expression patterns. The number of indicators for the classification into the small number of groups (i.e. the number of the classified groups) is not more than 20, preferably not more than 10, and more preferably 2. For example, the number is two when the groups consist of a hormone receptor-positive group and a hormone receptor-negative group (however, there might be cases where there is another group created in which the positive and negative groups are mixed). Based on the result of analysis, an expression pattern characteristic of a particular state of cancer is determined (see FIG. 1(h)). The state of cancer can be predicted by comparing the expression pattern in a specimen whose state of cancer is to be predicted with the patterns of the classified groups. It is also possible to determine, based on the results of classification, the malignancy of cancer or whether or not metastasis is occuring. Thereafter a gene specific to a different state of cancer, such as malignancy, can be determined by using the result of expression pattern analysis in the cancer state predicting method, and a drug can be designed for controlling the expression of the gene or the activity of a gene product.

1. Quantification of Gene Expression

In order to determine gene expression level, RNA is isolated from a specimen. Isolation of a gene may be performed by any known method. Examples of the method include a method by which cDNA is synthesized from an RNA prepared by a guanidine-isothiocyanate method. Examples of the gene to be isolated and determined include a gene derived from a primary cancerous lesion and a gene encoding an immunoglobulin, and many other genes thought to be relevant to cancer prediction may be selected by searching the literature.

Gene expression data may be obtained by any desired method, such as competitive PCR, TaqMan PCR, and Northern blot technique.

(1) Competitive PCR

Competitive PCR is a method for determining gene expression levels by amplifying identical genes contained in a plurality of samples in the same reaction system. One example of the competitive PCR method is an adaptor-tagged competitive PCR method (see FIG. 2). In this method, a different kind of adaptor sequence is added to each one of the identical cDNAs contained in at least two kinds of samples. The cDNAs are amplified after the individual samples containing cDNAs to which these adaptor sequences were added are mixed, and then quantitative ratios of the amplified cDNAs are determined (see Japanese Patent No. 2905192). In the following, the adaptor-tagged competitive PCR method is briefly described.

Initially, at least two kinds of samples containing the cDNA to be determined are prepared (two kinds of samples are taken as an example for simplicity). After cleaving the cDNAs in the samples with a specific restriction enzyme, an adaptor is added to the cleavage site. The adaptor refers to an oligonucleotide designed such that it can discriminate different cDNAs with the different oligonucleotides when amplified. The adaptors are designed as double-stranded such that they can bind to the restriction enzyme cleavage site of the cDNA. The adaptors may be designed such that the length of the adaptor added to the cDNA in one sample is different from that of the adaptor tagged to the cDNA in the other sample. Alternatively, the adaptors may be designed such that at least one restriction enzyme-recognition site is contained in both the adaptor added to the cDNA in one sample and the adaptor added to the cDNA in the other sample. Further alternatively, the adaptors may be designed such that the adaptor added to the cDNA in one sample is different in nucleotide sequence from that added to the cDNA in the other sample (examples A and B are shown in FIG. 2). These adaptors may be chemically synthesized. They may also be labeled by a fluorescent label or a radioisotope.

The samples each containing the adaptor tagged cDNA are mixed (preferably in equal amounts). Then, amplification is performed using the cDNAs in these samples as templates, by polymerase chain reaction (PCR), for example. After amplification, amplified products are detected using an automated sequencer (from e.g. Pharmacia) or an image scanner (from e.g. Molecular Dynamics). In the case where a radioisotope has been used, the detection is carried out using a densitometer or the like. As shown at the bottom of FIG. 2, the cDNAs can be quantitatively determined based on differences in the signal level from the labels in the sequences to which different adaptors were added.

(2) TaqMan PCR Method

TaqMan PCR is a method whereby amplification reaction and fluorescence intensity are measured simultaneously in a mixed reaction system (reaction tube) of a template, primer and labeled probe, so that fluorescent reporter dye released from a specific probe hybridized to the template is detected in real time and the PCR products are automatically analyzed by a computer connected to the detector (also called a real-time PCR method). This real-time detection PCR method is known, and apparatuses and kits for the method are commercially available. Thus, the present invention can employ such commercially available apparatuses or kits to detect gene expression (examples of the kits include TaqMan PCR kit and TaqMan EZ RT-PCR kit from ABI).

(3) Northern Blotting

The Northern blotting is a method for analyzing the size or amount of gene transcription products (mRNA) being expressed in a cell. Total RNA or mRNA extracted from the cell is subjected to denatured agarose gel electrophoresis, transferred onto a nylon or nitrocellulose membrane and fixed on the membrane. By hybridizing the membrane to a target gene, the size and existing amount of the mRNA of the gene are analyzed.

Apparatuses and kits for performing the Northern blotting are also commercially available. Examples include the Message Maker reagent set and a full-automatic electrophoresis blotting system (from Labimap).

(4) Detection by PCR Method

Primers for the detection of the above-mentioned gene, that is a forward primer (sense primer) and a reverse primer (anti-sense primer) for PCR, are designed and synthesized based on the nucleotide sequence of the gene so that, taking into account the amplification efficiency of PCR, the size of amplified fragment may be about 50 to 200 bp. The reverse primer is designed such that it is complementary to the based sequence. The primers may be designed by selecting a plurality of desired sequences from one or more different kinds of sequences taken from the above-mentioned based sequences.

The above primers may be chemically synthesized in a conventional manner, such as by using a DNA automatic synthesizer from Applied Biosystems (the same applies to nucleotide synthesis below). In the case of adaptor-tagged competitive PCR, only the reverse primer needs to be designed toward the poly (A) from the adaptor-tagged site.

(5) Probe

The probes used in the present invention may comprise an oligonucleotide labeled by binding a fluorescent reporter dye and a fluorescent quencher dye thereto.

The oligonucleotide portion of the gene detection probe may be designed on the basis of all or part of the sequence of the gene used in the present invention. Further, the oligonucleotide can be used that is capable of hybridizing to all or part of the nucleotide sequence of the gene under stringent conditions and that has a sequence of at least 15 contiguous nucleotides.

“Stringent conditions” refer to conditions where, in the case of using the TaqMan probe in real-time PCR, the probe and the primers simultaneously associate or hybridize with the template DNA. More specifically, the conditions include the use of a conventional buffer solution at temperatures of 60 to 65° C. Accordingly, the probe used in the present invention may have a mutation such as deletion, substitution or addition in one or more (e.g. from 1-10) nucleotides, as long as the probe can hybridize to the DNA to be detected under the above-mentioned stringent conditions. Further, the probe sequence may have approximately 1-10% of mismatchs to the nucleotide sequence of the region to be hybridized, as long as it can hybridize under the stringent conditions.

As a result of fluorescent resonance energy transfer, the fluorescence intensity of the above fluorescent reporter dye is suppressed when it is bound to the same probe as that to which the fluorescent quencher dye is bound. The intensity is not suppressed when the fluorescent reporter dye is not bound to the same probe as that of the fluorescent quencher dye. The fluorescent reporter dye may be preferably be the fluorescein type, such as FAM (6-carboxy-fluorescein). The fluorescent quencher dye may be preferably of the rhodamine type, such as TAMRA (6-carboxy-tetramethyl-rhodamine). These fluorescent dyes are known and readily available. The binding sites of the fluorescent reporter dye and fluorescent quencher dye are not particularly limited. Typically, the fluorescent reporter dye binds to one end (preferably the 5′-end) of the oligonucleotide of the probe, and the fluorescent quencher dye binds to the other end.

2. Selection of the Gene

From among the genes of which expression levels were measured as described above, genes useful for the multivariate analysis to be described later are selected. “Useful genes” refer to those genes that are selected from among the genes from which expression levels have been measured above and which can be discriminated or classified according to differences in the expression level when multivariate analysis is performed as described below. In the present invention, initially, genes that are to be used for quantitative determination of expression for the purpose of predicting prognosis, for example, are selected. The genes used for the quantitative determination of expression are ones useful in classifying cancer specimens and which satisfy predetermined criteria, and are selected depending on the type of cancer that is to be predicted. In the present invention, the types of genes used for predicting prognosis, for example, are not particularly limited as long as they are expressed in a primary cancerous lesion. Types of cancer include breast cancer, stomach cancer, esophageal cancer, oral cancer, colon cancer, rectal cancer, anal cancer, pancreatic cancer, lung cancer, renal cancer, bladder cancer, ovarian cancer, uterine cancer, skin cancer, melanoma, central nervous tumor, peripheral nervous tumor, gum cancer, pharyngeal cancer, maxillary and jowl cancer, liver cancer, prostate cancer, leukemia, multiple myeloma, and malignant limphoma. A gene expressed in at least one type of cancer selected from the above group can be used. The method for selecting the gene varies depending on the type of cancer. For example, the method includes selection by: the expression of hormone receptor; the result of other cluster analyses; presence or absence of lymph node metastasis; presence or absence of recurrence; prognostic factors; and/or tissue type. An example of metastasis is lymph node metastasis. An example of recurrence is early recurrence, which is a systemic recurrence within two years after an operation. Thus, by selecting genes useful in classifying tumor tissue and performing multivariate analysis, tumor tissue can be classified into groups according to characteristics of cancer development based on expression profile.

When predicting for breast cancer, a gene determining whether or not hormone receptors are expressed, particularly estrogen receptors, is preferable, in that it plays an important role in determining the nature of the breast cancer. When predicting for colon cancer, it is preferable to classify genes into a statistically significant number of clusters by performing cluster analysis according to the expression pattern of the genes, and select a group of genes belonging to a cluster relating to metastasis and/or prognosis factors. Clusters relating to metastasis and/or prognosis factors can be selected by performing principal component analysis or hierarchical cluster analysis on each of the above-classified clusters for their expression patterns, classifying samples according to expression patterns, and then examining the relationship between this classification and the prognosis and/or prognosis factors. In this case, therefore, all of the genes are subjected to multivariate analysis in advance in order to select the genes useful for further multivariate analysis.

In the present invention, when classifying cancer specimens using the genes by which the presence or absence of expression of the above-mentioned estrogen receptors can be determined, the expression can be linked to metastasis or recurrence based on the different degrees of malignancy of the specimens. “Genes by which the presence or absence of estrogen receptor can be determined” refer to those genes by which the specimens can be classified into an estrogen receptor-positive group and an estrogen receptor-negative group, when determining the expression level of a gene isolated from a specimen, and performing multivariate analysis as described later. Specifically, a plurality of specimens (normal and cancerous tissues) are collected and reacted with an antibody against estrogen receptor to determine whether the specimens are positive or negative for the receptor. Based on the results of this determination and on those of the expression of the above genes, cluster analysis is performed so that genes are selected by which the specimens can be classified into estrogen receptor-positive and negative groups.

In the present invention, cancer specimens can be related to metastasis or recurrence based on differences in the degree of malignancy by classifying the specimens by cluster analysis using a gene group(s) belonging to the cluster relating to metastasis and/or prognosis factors.

In the selection of the genes, prior to selecting the genes on the basis of the above-mentioned predetermined criteria, the ratio of the variation of gene expression level in cancer specimens to the variation of gene expression level in normal specimens may be calculated, so that genes satisfying a predetermined criteria can be selected in advance.

Variation within subgroup (Vg) is expressed by the following equation:
$\begin{matrix} Vg = \sum_{i = 1}^{p} \sum_{j = 1}^{q} {(Xi - \overline{Xj})}^{2} & (I) \end{matrix}$

wherein {overscore (Xj)} is an average of the gene expression levels in each group, p is the number of genes, q is the number of groups, and Xi is the expression level of a gene. Thus, Vg is the sum of the square of the difference between each level and the average in the normal or cancer specimen group. The ratio may be suitably changed depending on some factors including the type of genes to be analysed, the number of cases, and the number of genes. However, the ratio is normally from 1.10 to 1.20, preferably not less than 1.18 (e.g. from 1.80 to 1.20).

In the case of breast cancer, for example, the selection of genes can be performed by applying the principle of analysis of variance to the presence or absense of expression of estrogen receptors. First, by setting the ratio of the variation within the normal specimen subgroup to that within the cancer specimen subgroup at 1.20, for example, 152 genes out of 2412 genes can be selected in advance. Next, for the tissue or cell samples in each case (e.g. blood, removed lesion, biopsy sample), the presence or absence of expression of the estrogen receptor is detected by using an antibody against the estrogen receptor in a conventional manner (ELISA or RIA, for example), and dividing the samples into an estrogen receptor-positive group and an estrogen receptor-negative group. Thereafter the ratio of the variation of expression level within each group (variation within subgroup) to the variation of all of groups (total variation) is calculated. Genes for which this ratio satisfies a predetermined criteria are selected.

The total variation (Vt) is expressed by the following equation:
$\begin{matrix} Vt = \sum_{i = 1}^{p} {(Xi - \overline{Xt})}^{2} & (II) \end{matrix}$

wherein Xi and p are as described above, and {overscore (Xt)} is an average of the gene expression levels in total all the samples. Thus, Vt indicates the sum of the squares of the difference between each value of the gene expression level and the total average of the positive and negative groups.

The variation within subgroup (Vg) is as described above, namely it is expressed by the following equation:
$\begin{matrix} Vg = \sum_{i = 1}^{p} \sum_{j = 1}^{q} {(Xi - \overline{Xj})}^{2} & (I) \end{matrix}$

wherein {overscore (Xj)} is an average of the gene expression levels within each group, q is the number of groups, and Xi and p are as described above. Thus, Vg is the sum of the square of the difference between the detected level of each sample and the average of the positive or negative group.

The ratio may be suitably changed depending on some factors including the type of genes to be analyzed, the number of cases, and the number of genes. However, the ratio (total variation/variation within subgroup) is normally from 1.10 to 1.20, preferably not less than 1.18 (e.g. from 1.18 to 1.20).

In the present invention, when the indicator is such that the specimens are divided into the estrogen receptor-positive (ER+) group and negative (ER−) group, 27 types of genes (gene group I) can be selected as shown in numbers 1 to 27 in the “No.” column of Table 1 below. These genes are genes by which, when subjected to multivariate analysis, the presence or absence of expression of the estrogen receptor can be discriminated.

TABLE 1GeneNo.nameA.N. (EST)A.N.Contents of geneSEQ ID NO:1GS1176AI092005S82616p6 = cytochrome c oxidase1subunit VIchomolog/COSVIc/prostaticcarcinomaupregulated gene [human,prostate carcinoma cell linePC3, mRNA Partial, 261 nt].2GS1472AW249669L13850Homo sapiens hXBP-12transcription factor DNA.3GS1957AA622242D90427Human mRNA for3zinc-alpha2-glycoproteinprecursor, complete cds.4GS2307AI141674X14583Human mRNA for4immunoglobulin (Ig)lambda-chain.5GS2471AI36942956GS2632AI891146Z97630Human DNA sequence from6clone RP3-466N1 onchromosome 22q12-13:Contains the H1F0 (H1histone family member 0)gene, 2-amino-3-ketobutyrateCoA ligase (nuclear geneencoding mitochondrialprotein), and GALR3 (gal).7GS2828AI923193X51755Human7lambda-immunoglobulinconstant region complex(germline).8GS6601AI041822X60111H. sapiens mRNA for MRP-1.89GS6770AW052188910GS6784AI570120U92544Human hepatocellular10carcinoma associated protein(JCL-1) mRNA, completecds.11GS690AI688954S70290Glutamine synthetase (human,11tumorous liver, mRNAPartial)12GS70121213GS7116AA885475AF161540Homo sapiens HSPC05513mRNA, complete cds.14GS7264AI128061AF089747Homo sapiens14alpha-1-antichymotrypsinprecursor, mRNA, partial cds.15GS7288AI912299AB021868Homo sapiens PIAS3 mRNA15for protein inhibitor ofactivated STAT3, completecds.16GS7583AI800822AF000231Homo sapiens rab11a GTPase16mRNA, complete cds.17GS7715AI126560AB011159Homo sapiens mRNA for17KIAA0587 protein, completecds.18GS6711AW0053731819GS7632AL119034AF111849Homo sapiens PRO053019mRNA, complete cds.20GS7435AI092753AF132948Homo sapiens CGI-14 protein20mRNA, complete cds.21GS7001AW051878AF037335Homo sapiens carbonic21anhydrase precursor (CA 12)mRNA, complete cds.22GS70712223GS6703AL359588DKFZp762B226 (nameless23complete cDNA)24GS4353AF120265Human tetraspan NET-624mRNA, complete cds.25GS3006M26326Human keratin 18 mRNA,25complete cds.26GS3295L15203Human secretory protein26(P1.B) mRNA27GS2189XM_002311Homo sapiens 126 bp27sequence
A.N.: Accession number.

In multivariate analysis, more than one desired gene out of gene group I can be selected in any combination. For example, genes indicated by Nos. 1-21 in the “No.” column of Table 1 should preferably be used. It is also possible to select one or more genes other than gene group I but for which expression levels have been measured and combine with one or more genes of gene group I. The genes other than those of gene group I may have characteristics which are totally different from or similar to those of the genes of gene group I. For example, genes encoding immunoglobulin or other genes may be selected.

In the case of colon cancer, for example, the genes can be selected by carrying out cluster analysis based on gene expression patterns and thus classifying the genes into a statistically significant number of clusters according to the gene expression patterns, thereby selecting a gene group belonging to a cluster preferable for multivariate analysis. In the present invention the cluster preferable for multivariate analysis is a cluster relating to, e.g., metastasis and/or prognostic factors. The cluster relating to the metastasis and/or prognostic factors can be selected by classifying the samples (specimens) of each of the above-classified clusters according to expression patterns by principal component analysis or hierarchical cluster analysis, and then using the relationship between this classification and the prognosis and/or prognostic factors as a reference or indicator.

In the present invention, the present inventors have found that 1536 genes relating to colon cancer could be classified by cluster analysis into 44 clusters, of which the cluster relating to metastasis was cluster No. 14, and the clusters relating to the prognostic factor were clusters Nos. 42-44. As the genes belonging to cluster No. 14, the 126 genes (referred to as gene group II) indicated by Nos. 28-153 in the “No.” column in Table 2 below can be selected and they could be used for multivariate analysis. As the genes belonging to cluster Nos. 42-44, the 136 genes indicated by Nos. 154-289 in the “No.” column in Table 3 below could be selected (“gene group III”) and they could be used for multivariate analysis. These genes are related to metastasis or prognosis when multivariate analysis is performed.

TABLE 2GeneNo.nameA.N.Contents of gene28GS2695Y14737Homo sapiens mRNA for immunoglobulin lambda heavy chain.29GS169M23905Human MHC class II lymphocyte antigen (DPw4-alpha-1) gene.30GS846M13560Human Ia-associated invariant gamma-chain gene, exon 8,clones lambda-y(1, 2, 3).31GS2307X14583Human mRNA for Ig lambda-chain.32GS3628S49006Ig kappa {clone cYF.kappa} [human, mRNA Partial, 1209 nt].33GS3813L12387Human sorcin (SRI) mRNA, complete cds.34GS3616AF072097Homo sapiens beta-2 microglobulin gene, complete cds.35GS1954Z11793H. sapiens mRNA for selenoprotein P.36GS3642L26165Human DNA synthesis inhibitor mRNA, complete cds.37GS2660X04412Human mRNA for plasma gelsolin.38GS690AL161952Homo sapiens mRNA; cDNA DKFZp434M0813 (from cloneDKFZp434M0813); partial cds.39GS1864M36501Human alpha-2-macroglobulin mRNA, 3′ end.40GS4210X67698H. sapiens tissue specific mRNA.41GS1119AF207829Homo sapiens SCAN-related protein RAZ1 (RAZ1) mRNA,partial cds.42GS3516D14043Human mRNA for MGC-24, complete cds.43GS1329V01512Human cellular oncogene c-fos (complete sequence).44GS2960AF226869Homo sapiens RB-associated KRAB repressor (RBAK) mRNA,complete cds.45GS5569U14750Human connective tissue growth factor mRNA, partial cds.46GS1834X62320H. sapiens mRNA for epithelin 1 and 2.47L09159L09159Homo sapiens RHOA proto-oncogene multi-drug-resistanceprotein mRNA, 3′ end.48X05026X05026Human rho mRNA (clone 12).49GS3614AF021792Homo sapiens Bcl-X/Bcl-2 binding protein (BAD) mRNA,partial cds.50GS2704AA025421H. sapiens mRNA for HES1 protein51GS3295L15203Human secretory protein (P1.B) mRNA, complete cds.52GS4523X73685C. aethiops hsp70 mRNA.53GS2707AF054183Homo sapiens GTP binding protein mRNA, complete cds.54GS194AF217517Homo sapiens uncharacterized bone marrow protein BM041mRNA, complete cds.55GS2723M35252Human CO-029.56GS3240M58485Human lysosomal membrane glycoprotein CD63 mRNA.57GS3529X04297Human mRNA for Na, K-ATPase alpha-subunit.58GS1054X75861H. sapiens TEGT gene.59GS6870D16562Human mRNA for ATP synthase gamma-subunit (L-type),complete cds.60GS1008M11353Human H3.3 histone class C mRNA, complete cds.61GS3048X61156H. sapiens mRNA for laminin-binding protein.62GS1458X54304Human mRNA for myosin regulatory light chain.63GS1756AK000841Homo sapiens cDNA FLJ20834 fis, clone ADKA02953,highly similar to AF115384 Homo sapiens LR8.64GS821M27891Human cystatin C (CST3) gene, exon 3.65GS3245M84643Macaca mulatta thioredoxin mRNA, complete cds.66GS2260U27143Human protein kinase C inhibitor-I cDNA, complete cds.67GS2991L07287Human ribosomal protein L26 (RPL26) gene exon 2, completecds.68GS2789X04803Homo sapiens ubiquitin gene.69GS137L19739Homo sapiens metallopanstimulin (MPS1) mRNA, completecds.70GS3565L25085Human Sec61-complex beta-subunit mRNA, complete cds.71AF077034AF077034Homo sapiens HSPC010 mRNA, complete cds.72GS3819AF117616Homo sapiens SOUL protein (SOUL) mRNA, complete cds.73GS3424AK000462Homo sapiens cDNA FLJ20455 fis, clone KAT05813.74GS4401U47414Human cyclin G2 mRNA, complete cds.75GS4568AL049963Homo sapiens mRNA; cDNA DKFZp564A132 (from cloneDKFZp564A132).76GS6584J04611Human lupus p70 (Ku) autoantigen protein mRNA, completecds.77GS4090U24704Human antisecretory factor-1 mRNA, complete cds.78GS2932X52317Human mRNA for histone H2A.Z.79GS2365Z49835H. sapiens mRNA for protein disulfide isomerase.80GS2495D00422Human sphingolipid activator proteins, mRNA, complete cds.81GS3021X01060Human mRNA for transferrin receptor.82GS3823X05344Human mRNA for cathepsin D from oestrogen responsivebreast cancer cells.83GS983M30685Pan Troglodytes MHC class I protein mRNA (MHCPATRF1).84GS726X58536Human mRNA for HLA class I locus C heavy chain.85GS3409NM_001101Homo sapiens actin, beta (ACTB), smRNA.86GS7358M74817Human tropomyosin-1 (TM-beta) mRNA, complete cds.87GS3542D49400Homo sapiens mRNA for vacuolar ATPase, complete cds.88GS2965U90654Human zinc-finger domain-containing protein mRNA, partialcds.89GS1990X04481Human mRNA for complement component C2.90GS3222U44954Human NMDA receptor glutamate-binding chain (hnrgw)mRNA, partial cds.91GS697Z37166H. sapiens BAT1 mRNA for nuclear RNA helicase (DEADfamily).92GS1353D50372Homo sapiens mRNA for myosin regulatory light chain,complete cds.93GS3621AC004938Homo sapiens clone DJ0971C03, complete sequence.94GS2907AA633993Cell division cycle 10 (homologous to CDC10 of S. cerevisiae95GS3383AK000070Homo sapiens cDNA FLJ20063 fis, clone COL01524.96GS3043M32306Human epithelial glycoprotein (EGP) mRNA, complete cds.97GS6968AF044221Homo sapiens HCG-1 protein (HCG-1) mRNA, complete cds.98GS2998D87667Human brain mRNA homologous to 3′UTR of human CD24gene, partial sequence.99GS2752M29540Human carcinoembryonic antigen mRNA (CEA), complete cds.100GS3752AB018270Homo sapiens mRNA for KIAA0727 protein, partial cds.101GS3223AL133580Homo sapiens mRNA; cDNA DKFZp434N2072 (from cloneDKFZp434N2072).102GS1264M21575Human cytochrome c oxidase COX subunit IV (COX IV)mRNA, complete cds.103GS201AB009010Homo sapiens mRNA for polyubiquitin UbC, complete cds.104GS3904Z35415H. sapiens gene encoding E-cadherin, exon 16.105GS3390U29091Human selenium-binding protein (hSBP) mRNA, complete cds.106GS2252X01683Human mRNA for alpha 1-antitrypsin.107GS3412J03544Human brain glycogen phosphorylase mRNA, complete cds.108GS2952AF007194Homo sapiens mucin (MUC3) mRNA, partial cds.109GS3116X91863H. sapiens Gpx2 gene.110GS3779M81600Human NAD(P)H: quinone oxireductase gene, exon 6.111GS1655J03746Human glutathione S-transferase mRNA, complete cds.112GS145M77234Human ribosomal protein S3a mRNA, complete cds.113GS2032AF104238Homo sapiens ADP-ribosylation factor 4 (ARF4) gene,exon 6 and complete cds.114GS133J04617Human elongation factor EF-1-alpha gene, complete cds.115GS7058U35048Human TSC-22 protein mRNA, complete cds.116GS2547X96752H. sapiens mRNA for L-3-hydroxyacyl-CoA dehydrogenase.117GS4723AF047470Homo sapiens malate dehydrogenase precursor (MDH) mRNA,nuclear gene encoding mitochondrial protein, complete cds.118GS243AB021288Homo sapiens mRNA for beta 2-microglobulin, complete cds.119GS3682AF151802Homo sapiens CGI-44 protein mRNA, complete cds.120GS2791D13629Human mRNA for KIAA0004 gene, complete cds.121GS7410AF075010Homo sapiens full length insert cDNA YI03D03.122GS1208U72512Human B-cell receptor associated protein (hBAP)alternatively spliced mRNA, partial 3′UTR.123GS7407AA485677Human zyxin related protein ZRP-1 mRNA, complete cds124GS3119NM_003641Homo sapiens interferon induced stransmembraneprotein 1 (9-27) (IFITM1), mRNA.125GS988L08246Human myeloid cell differentiation protein (MCL1) mRNA.126U15173U15173Homo sapiens BCL2/adenovirus E1B 19 kD-interacting protein2mRNA, complete cds.127GS2263AB002382Human mRNA for KIAA0384 gene, complete cds.128GS2848AC004258Homo sapiens chromosome 19, cosmid R33114, completesequence.129GS2535AL096818Human DNA sequence from clone RP1-262C15 on chromosome6q16.1-21.Contains the 3′ end of a novel gene, ESTs, STSs and GSSs,complete sequence.130GS3269AF113016Homo sapiens PRO1073 mRNA, complete cds.131L11910L11910Human retinoblastoma susceptibility gene exons 1-27, completecds.132GS3644Z85986Human DNA sequence from clone 108K11 on chromosome 6p21Contains SRP20 (SR protein family member), Ndr protein kinasegene similar to yeast suppressor protein SRP40, EST and GSS,complete sequence.133GS2973V00662H. sapiens mitochondrial genome.134GS2726AJ002190Homo sapiens cDNA for dihydroxyacetonephosphateacyltransferase (DAP-AT).135GS906AF182645Homo sapiens chondrosarcoma-associated protein 2 (CSA2)mRNA, complete cds.136GS3950L06070Human squalene synthetase (ERG9) mRNA, complete cds.137GS2524X71129H. sapiens mRNA for electron transfer flavoprotein beta subunit.138GS1768J04058Human electron transfer flavoprotein alpha-subunit mRNA,complete cds.139GS5905D13866Human mRNA for alpha-catenin, complete cds.140GS3741L08666Homo sapiens porin (por) mRNA, complete cds and truncatedcds.141GS3426D21235Human mRNA for HHR23A protein complete cds.142GS2512BC005402Homo sapiens , clone MGC: 12543, mRNA, complete cds.143GS1662Y10211H. sapiens LAG-3 gene, promoter region.144GS3873D14662Human mRNA for KIAA0106 gene, complete cds.145GS261AF047439Homo sapiens unknown mRNA, complete cds.146GS3611NM_006793Homo sapiens peroxiredoxin 3s(PRDX3), nuclear gene encodingmitochondrial protein, smRNA.147GS273X59066Human mRNA for mitochondrial ATP synthase (F1-ATPase)alpha subunit.148GS242M81457Human calpactin 1 light chain mRNA, complete cds.149GS410M11146Human ferritin H chain mRNA, complete cds.150GS599M77233Human ribosomal protein S7 mRNA, 3′ end.151GS1042X87838H. sapiens mRNA for beta-catenin.152GS308Y00345Human mRNA for poly A binding protein.153GS3608X63753H. sapiens son-a mRNA.
A.N.: Accession Number

TABLE 3

Gene

No.
name
A.N.
Contents of gene

154
GS2976
X87342

H. sapiens mRNA for human giant larvae homolog.

155
GS4595
BC002507

Homo sapiens , WD repeat domain 13, clone

MGC: 1020, mRNA, complete cds.

156
GS3264
X06347
Human mRNA for U1 small nuclear RNP-specific

A protein.

157
GS2141
L35013
Human spliceosomal protein (SAP 49) gene,

complete cds.

158
GS3995
NM_003379

Homo sapiens villin 2 (ezrin)s(VIL2), mRNA.

159
GS4409
AK001523

Homo sapiens cDNA FLJ10661 fis, clone

NT2RP2006106.

160
GS4687
NM_019606

Homo sapiens hypothetical proteinsFLJ20257

(FLJ20257), mRNA.

161
GS2984
AJ000342

Homo sapiens mRNA for DMBT1 6 kb transcript

variant 1 (DMBT1/6 kb.1).

162
GS2891
X74215

H. sapiens mRNA for Lon protease-like protein.

163
GS4065
AL050372

Homo sapiens mRNA; cDNA DKFZp434A091

(from clone DKFZp434A091).

164
GS4782
NM_004911

Homo sapiens protein disulfidesisomerase related

protein (calcium-binding protein,

sintestinal-related) (ERP70), mRNA.

165
GS4735
AF118224

Homo sapiens matriptase mRNA, complete cds.

166
GS724
AF077051

Homo sapiens PTD001 mRNA, complete cds.

167
GS4072
AF151806

Homo sapiens CGI-48 protein mRNA, complete

cds.

168
GS4682
AF035959

Homo sapiens type-2 phosphatidic acid

phosphatase-gamma (PAP2-g) mRNA, complete

cds.

169
GS3068
M15205
Human thymidine kinase gene, complete cds,

with clustered Alu repeats in the introns.

170
GS2846
X65614

H. sapiens mRNA for calcium-binding protein

S100P.

171
GS4185
AJ011916

Homo sapiens mRNA for hypothetical protein.

172
GS3154
AL121673
Human DNA sequence from clone RP11-305P22

on chromosome 20 Contains ESTs, STSs, GSSs and

7 CpG islands.

Contains three novel genes and a novel gene for

a helix-loop-helix DNA binding protein, complete

sequence.

173
GS502
X63423

H. sapiens mRNA for delta-subunit of

mitochondrial F1F0 ATP-synthase (clone #5).

174
GS4552
D45370
Human apM2 mRNA for GS2374 (unknown

product specific to adipose tissue), complete cds.

175
GS5215
M30952
Orangutan 28S ribosomal RNA gene fragment.

176
GS2425
AB023165

Homo sapiens mRNA for KIAA0948 protein,

complete cds.

177
GS4267
NM_013440

Homo sapiens pairedsimmunoglobulin-like

receptor beta (PILR(BETA)), mRNA.

178
GS4832
NM_001571

Homo sapiens interferon regulatory factor 3

(IRF3), mRNA.

179
GS3249
U22055
Human 100 kDa coactivator mRNA, complete cds.

180
GS855
AK001829

Homo sapiens cDNA FLJ10967 fis, clone

PLACE1000798.

181
GS765
AJ238097

Homo sapiens mRNA for Lsm5 protein.

182
L33075
L33075

Homo sapiens ras GTPase-activating-like protein

(IQGAP1) mRNA, complete cds.

183
GS4482
AF154502

Homo sapiens quiescent cell proline dipeptidase

(QPP) mRNA, complete cds.

184
GS4008
NM_012426

Homo sapiens splicing factor 3b, ssubunit 3, 130 kD

(SF3B3), mRNA.

185
GS1540
U19721
Human peroxisomal targeting signal receptor 1

(PXR1) mRNA, complete cds.

186
GS2869
AL023653
Human DNA sequence from clone 753P9 on

chromosome Xq25-26.1. Contains the gene coding

for Aminopeptidase P (EC 3.4.11.9,

XAA-Pro/X-Pro/Proline/Aminoacylproline

Aminopeptidase) and a novel gene. Contains ESTs,

STSs, GSSs and a gaaa repeat polymorphism,

complete sequence.

187
GS4498
AF151105

Homo sapiens 3′-5′ exonuclease TREX1 mRNA,

complete cds.

188
GS4263
BC004925

Homo sapiens , Similar to G protein-coupled

receptor, family C, group 5, member C, clone

MGC: 10304, mRNA, complete cds.

189
GS4198
AK000453

Homo sapiens cDNA FLJ20446 fis, clone

KAT05231.

190
GS3749
M98326
Human transfer valyl-tRNA synthetase mRNA, 3′

end of cds.

191
D38122
D38122
Human mRNA for Fas ligand, complete cds.

192
R76314
R76314
Ras homolog gene family, member G (rho G)

193
GS2718
U31201
Human laminin gamma2 chain gene (LAMC2),

exon 23 and flanking sequences, and complete cds.

194
GS3664
L16862

Homo sapiens G protein-coupled receptor kinase

(GRK6) mRNA, complete cds.

195
GS4718
Z14978

H. sapiens mRNA for actin-related protein.

196
GS3193
AL022240
Human DNA sequence from clone 328E19 on

chromosome 1q12-21.2.

Contains a cyclophilin-like gene, a novel gene,

ESTs, GSSs and STS, complete sequence.

197
GS3533
X05231
Human mRNA for collagenase (E.C. 3.4.24).

198
GS4112
AF054178

Homo sapiens CI-B14.5a homolog mRNA,

complete cds.

199
GS4559
NM_002085

Homo sapiens glutathionesperoxidase 4

(phospholipid hydroperoxidase) (GPX4), smRNA.

200
GS3260
D86966
Human mRNA for KIAA0211 gene, complete cds.

201
GS3924
AB002312
Human mRNA for KIAA0314 gene, partial cds.

202
GS2867
D84471

Homo sapiens mRNA for phenylalanyl tRNA

synthetase beta subunit, complete cds.

203
GS3014
AL096737

Homo sapiens mRNA; cDNA DKFZp434F152

(from clone DKFZp434F152).

204
GS4183
AF068007

Homo sapiens cell cycle-regulated factor p78

mRNA, complete cds.

205
GS4515
AK000154

Homo sapiens cDNA FLJ20147 fis, clone

COL07954.

206
GS779
AF069378

Homo sapiens insulin-like growth factor II

receptor (IGF2R) gene, exon 48 and partial cds.

207
GS4438
NM_018475

Homo sapiens TPA regulated locus (TPARL),

mRNA.

208
GS4407
AC005361

Homo sapiens chromosome 16, cosmid clone

352F10 (LANL), complete sequence.

209
GS4452
U71274
Human mutant factor XII gene, exon 14 and partial

cds.

210
GS3234
X65024

H. sapiens mRNA for xeroderma pigmentosum

group C complementing factor (XP-C).

211
U37100
U37100

Homo sapiens aldose reductase-like peptide

mRNA, complete cds.

212
GS3829
AF042346

Homo sapiens putative phenylalanyl-tRNA

synthetase beta-subunit mRNA, complete cds.

213
GS1393
AF094516

Homo sapiens E1-like protein mRNA, complete

cds.

214
GS4048
AL110243

Homo sapiens mRNA; cDNA DKFZp564B0482

(from clone DKFZp564B0482); complete cds.

215
GS3944
AB003184

Homo sapiens mRNA for ISLR, complete cds.

216
GS3235
U31556
Human transcription factor E2F-5 mRNA,

complete cds.

217
GS3408
Z18538

H. sapiens encoding skin-derived

antileukoproteinase.

218
GS3248
NM_012262

Homo sapiens heparan

sulfates2-O-sulfotransferase (HS2ST1), mRNA.

219
GS4420
AF060567

Homo sapiens sushi-repeat protein (SRPUL)

mRNA, complete cds.

220
GS3004
Y12777

Homo sapiens mRNA for acyl-CoA synthetase-like

protein.

221
GS3310
AB032903

Homo sapiens GMPR2 mRNA for guanosine

monophosphate reductase isolog, complete cds.

222
GS1333
AK024628

Homo sapiens cDNA: FLJ20975 fis, clone

ADSU01705.

223
GS2948
AF151908

Homo sapiens CGI-150 protein mRNA, complete

cds.

224
GS2124
AB011128

Homo sapiens mRNA for KIAA0556 protein,

partial cds.

225
GS3751
X17567

H. sapiens RNA for snRNP protein B.

226
GS4173
NM_004719

Homo sapiens splicing factor's

arginine/serine-rich 2, interacting protein

(SFRS2IP), smRNA.

227
GS2765
AF075589

Homo sapiens MDA-MB-231 peripheral-type

benzodiazepine receptor (PBR) mRNA, partial cds.

228
GS4811
AL157435

Homo sapiens mRNA; cDNA DKFZp434O0510

(from clone DKFZp434O0510).

229
GS3135
M59371
Human protein tyrosine kinase mRNA, complete

cds.

230
GS4162
X92762

H. sapiens mRNA for tafazzins protein.

231
GS2744
J02947
Human extracellular-superoxide dismutase (SOD3)

mRNA, complete cds.

232
GS4893
NM_019027

Homo sapiens hypothetical proteins(FLJ20273),

mRNA.

233
AI341099
AI341099
Orf1 5′ to PD-ECGF/TP . . . orf2 5′ to PD-ECGF/TP

[human, epidermoid carcinoma cell line A431,

mRNA, 3 genes, 1718 nt]

234
GS1494
AL133034

Homo sapiens mRNA; cDNA DKFZp727K171

(from clone DKFZp727K171); partial cds.

235
GS3474
D83174
Human mRNA for collagen binding protein 2,

complete cds.

236
GS2928
NM_015140

Homo sapiens KIAA0153 proteins(KIAA0153),

mRNA.

237
GS4106
AB020628

Homo sapiens mRNA for KIAA0821 protein,

complete cds.

238
GS4000
AF055377

Homo sapiens long form transcription factor

C-MAF (c-maf) mRNA, complete cds.

239
GS3286
Y07604

H. sapiens mRNA for nucleoside-diphosphate

kinase.

240
L11701
L11701
Human phospholipase D mRNA, complete cds.

241
GS3170
L35240
Human enigma gene, complete cds.

242
GS2892
NM_004368

Homo sapiens calponin 2 (CNN2), smRNA.

243
GS4015
NM_005433

Homo sapiens v-yes-1 Yamaguchi sarcoma viral

oncogene homolog 1 (YES1), mRNA.

244
GS3588
AF131848

Homo sapiens clone 24922 mRNA sequence,

complete cds.

245
GS4780
AD001530

Homo sapiens XAP-5 mRNA, complete cds.

246
GS4941
NM_016380

Homo sapiens differentiation-related protein dif13

(LOC51212), mRNA.

247
GS4945
NM_016343

Homo sapiens centromere protein Fs(350/400 kD,

mitosin) (CENPF), mRNA.

248
GS4163
AC007565

Homo sapiens chromosome 19, cosmid R27656,

complete sequence.

249
GS3387
NM_013317

Homo sapiens hT1a-1 (hT1a-1), smRNA.

250
GS3386
NM_003337

Homo sapiens ubiquitin-conjugating enzyme E2B

(RAD6 homolog) (UBE2B), smRNA.

251
GS4946
NM_002439

Homo sapiens mutS (E. coli) homolog 3 (MSH3),

mRNA.

252
GS3019
NM_003348

Homo sapiens ubiquitin-conjugating enzyme E2N

(homologous to yeastsUBC13) (UBE2N), mRNA.

253
GS4022
NM_002433

Homo sapiens myelin oligodendrocyte

glycoprotein (MOG), mRNA.

254
GS4947
NM_018520

Homo sapiens hypothetical protein PRO2268

(PRO2268), mRNA.

255
GS1341
BC001002

Homo sapiens , clone IMAGE: 3447696, mRNA,

partial cds.

256
GS4512
NM_005768

Homo sapiens putative protein similar to nessy

(Drosophila) (C3F), mRNA.

257
GS4501
AF261689

Homo sapiens DNA polymerase epsilon p17

subunit gene, complete cds.

258
GS6969
AL022316
Human DNA sequence from clone CTA-126B4 on

chromosome 22q13.2-13.31 Contains two or three

novel genes, ESTs, STSs, GSSs and a CpG Island,

complete sequence.

259
GS6493
NM_014925

Homo sapiens KIAA1002 proteins(KIAA1002),

mRNA.

260
GS715
NM_020221

Homo sapiens hypothetical

proteinsDKFZp547I224 (DKFZp547I224), mRNA.

261
GS3002
U50871
Human familial Alzheimer's disease (STM2) gene,

complete cds.

262
GS1102
AL020995
Human DNA sequence from clone RP1-117O3 on

choromosome 1p33-34.3,

complete sequence.

263
GS5239
U93574
Human L1 element L1.39 p40 and putative p150

genes, complete cds.

264
M15990
M15990
Human c-yes-1 mRNA.

265
GS7322
AF119386

Homo sapiens blood plasma glutamate

carboxypeptidase precursor (PGCP) mRNA,

complete cds.

266
GS3683
AK026017

Homo sapiens cDNA: FLJ22364 fis, clone

HRC06575.

267
GS4288
NM_020983

Homo sapiens adenylate cyclase 6s(ADCY6),

transcript variant 2, mRNA.

268
GS2715
NM_016442

Homo sapiens type 1 tumor necrosis factor

receptor shedding aminopeptidase regulator

(ARTS-1), mRNA.

269
GS4364
NM_002339

Homo sapiens lymphocyte-specific protein 1

(LSP1), mRNA.

270
GS3138
M76378
Human cysteine-rich protein (CRP) gene, exons 5

and 6.

271
GS3607
U76713
Human apobec-1 binding protein 1 mRNA,

complete cds.

272
GS4976
NM_014133

Homo sapiens PRO0618 proteins(PRO0618),

mRNA.

273
GS964
AL136622
(Homo sapiens mRNA; cDNA DKFZp564B172

from clone DKFZp564B172); complete cds.

274
GS4824
AF150105

Homo sapiens small zinc finger-like protein

(TIM9b) mRNA, complete cds.

275
GS3217
NM_000786

Homo sapiens cytochrome P450, 51s(lanosterol

14-alpha-demethylase) (CYP51), mRNA.

276
GS3380
AB018255

Homo sapiens mRNA for KIAA0712 protein,

complete cds.

277
GS4038
L43619

Homo sapiens polycystic kidney disease (PKD1)

gene, exons 43-46.

278
GS4375
AL035541
Human DNA sequence from clone 718J7 on

chromosome 20q13.31-13.33. Contains part of a

gene for a novel protein, the PCK1 gene for

soluble phosphoenolpyruvate carboxykinase 1,

part of a novel gene similar to mouse DLM-1

(tumour stroma and activated macrophage protein),

the 3′ end of the TMEPAI gene encoding an

androgen induced 1b transmembrane protein

(PMEPA1), two putative novel genes, a CpG

island, ESTs, STSs and GSSs, complete sequence.

279
GS3847
NM_017964

Homo sapiens hypothetical proteinsFLJ20837

(FLJ20837), mRNA.

280
GS3289
AL136897

Homo sapiens mRNA; cDNA DKFZp434E248

(from clone DKFZp434E248); complete cds.

281
GS4702
S79219
metastasis-associated gene [human, highly

metastatic lung cell subline Anip[937], mRNA

Partial, 978 nt].

282
GS4742
NM_006468

Homo sapiens polymerase (RNA) IIIs(DNA

directed) (62 kD) (RPC62), mRNA.

283
GS2904
D38594
Human MTH1 gene for 8-oxo-dGTPase, exon4,

complete cds.

284
GS4563
U53347
Human neutral amino acid transporter B mRNA,

complete cds.

285
GS1178
U01184
Human homolog of D. melanogaster flightless-I

gene product mRNA, partial cds.

286
GS4062
AF183423

Homo sapiens reticulocabin precursor mRNA,

complete cds.

287
GS4394
U43923
Human transcription factor SUPT4H mRNA,

complete cds.

288
GS2956
U02619
Human TFIIIC Box B-binding subunit mRNA,

complete cds.

289
D26443
D26443
GLUT1(glucose transporter)

A.N.: Accession Number

In multivariate analysis, more than one desired gene can be selected from gene group II and/or gene group III in any combination. For example, it is referable to use genes of No. 30, 33, 34, 36-42, 44-48, 54-66, 68, 69, 71, 74, 80, 82, 83, 85, 93, 100-103, 112, 115, 116 and/or 118-121 from Table 2, and/or genes of No. 155, 162, 163, 167-169, 171, 172, 174, 175, 177-180, 188, 190, 193, 198, 211, 222, 242-253, 255-257, 259-261, 263 and/or 265 from Table 3. Further, more than one gene, not from gene groups II or III but for which expression level has been measured, may be combined with the above gene(s). The genes other than genes of gene groups II and/or III may have characteristics which are totally different from or similar to those of the genes of gene groups II and/or III. For example, genes encoding immunoglobulin or other genes may be selected.

3. Multivariate Analysis

The measured expression levels are analyzed by multivariate analysis. This is a statistical technique for analyzing relationships such as mutual dependency and subordination in a great number of statistical variables. Multivariate analysis basically involves p kinds of variables observed for each of n objects, but there is a variety of versions adapted for effective analysis of such multivariate data. Examples include but are not limited to cluster analysis, principal component analysis and discriminant analysis.

(1) Cluster Analysis

Cluster analysis usually refers to a technique by which, in the field of multivariate analysis, a number of objects for observation (samples) are gathered for similarity (dissimilarity) and classified into groups according to a predetermined basis of calculation (evaluation criterion). That is, cluster analysis merely “classifies” a number of observed samples into similar groups (or dissimilar groups).

Cluster analysis includes hierarchical cluster analysis and non-hierarchical analysis. Hierarchical cluster analysis initially views each sample as a single cluster, combines adjacent clusters and eventually combines the clusters into a single group. On the other hand, in non-hierarchical cluster analysis, the number of clusters to be generated is designated in advance, and hierarchical- cluster analysis is performed on data which are randomly selected from data at certain proportions, using the cluster number as a target. When the target number of clusters is reached, data which were not selected in the previous steps of analysis are combined into the already established clusters in various forms. Hierarchical cluster analysis allows the similarities of samples to be understood visually in the form of a dendrogram, and is often used in the field of biology. Accordingly, it is preferable to use hierarchical cluster analysis in the present invention.

(1-1) Hierarchical Cluster Analysis

In hierarchical cluster analysis, similar samples (clusters) are combined into an upper-hierarchy cluster. As the measure of similarity, the concept of distance is used. For example, supposing there are data {x_ij} (i=1, 2, . . . , n; j=1, 2, . . . , p) observed for n samples with p kinds of variables, the data {(X_ij} is as shown in Table 4:

TABLE 4Variable(j)Sample(i)12. . .p1x₁₁x₁₂. . .x_1p2x₂₁x₂₂. . .x_2p3x₃₁x₃₂. . .x_3p...............nx_n1x_n2. . .x_np

To perform cluster analysis based on the observation data given above, a “distance matrix” is generated, which indicates similarity between samples. The distance is calculated, for example in terms of Euclidian distance, weighted Euclidian distance, normalized Euclidian distance, and Pearson's product moment correlation coefficient.

Euclidian distance is the normally used distance. When an individual X_iis measured by p attributes (variables) and the value of the jth attribute is X_ij, Euclidian distance is expressed by the following equation:
$\begin{matrix} d (Xa, Xb) = \sqrt{\sum_{j = 1}^{p} {(Xaj - Xbj)}^{2}} & (III) \end{matrix}$

Weighted Euclidian distance is expressed by the following equation:
$\begin{matrix} d (Xa, Xb) = \sqrt{\sum_{j = 1}^{p} {kj (Xaj - Xbj)}^{2}} & (IV) \end{matrix}$

Weighted Euclidian distance is used when influences on distance are to be varied depending on the attributes. By reducing a weight kj, the contribution of an attribute j to distance is reduced (low data similarity). By increasing the weight, contribution to distance is increased (high data similarity).

Normalized Euclidian distance is expressed by the following equation:
$\begin{matrix} d (Xa, Xb) = \sqrt{\sum_{j = 1}^{p} \frac{{(Xaj - Xbj)}^{2}}{{Sj}_{2}}} {Sj}_{2} = \frac{\sum_{i = 1}^{n} {(Xij - \overline{Xmj})}^{2}}{n - 1} & (V) \end{matrix}$

wherein {overscore (Xmj )} is an average from Xlj to Xnj. In this equation, all the attributes are normalized to be variance=1. This equation is used in order to avoid introducing unintended “weights” due to differences in units of measure used for attributes. When calculating distance, since it does not matter where the origin is located, all the attributes are normalized to be average=0 and variance=1, and the Euclidian distance is calculated by using the normalized values.

A distance r (Pierson's product moment correlation coefficient) between case 1 (x₁, x₂, . . . , x_i, . . . , x_n) and case 2 (y₁, y₂, . . . , Y_i, . . . , y_n) is expressed by the following equation:
$\begin{matrix} r = \frac{\frac{1}{n - 1} \sum_{i = 1}^{n} (Xi - \overline{X}) (Yi - \overline{Y})}{\sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(Xi - \overline{X})}^{2}} \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(Yi - \overline{Y})}^{2}}} & (VI) \end{matrix}$

wherein {overscore (X)} and {overscore (Y)} indicate the averages of case 1 and case 2, respectively.

Based on the above concept of distance, the distance between clusters or between a cluster and an individual is calculated and the clusters are merged. Examples of the method of merging are as follows:

Nearest-neighbor method: Of the distances between individuals belonging to different clusters, the minimum value is used as the distance between the clusters. In this method, clusters with shorter distances between the nearest samples are merged as similar clusters.

Furthest-neighbor method: The greatest distance between any two individuals in the different clusters is used as the distance between the clusters. In this method, clusters with shorter distances between the furthest samples are merged as similar clusters.

Centroid method: The distance between barycenters of the respective clusters is used as the distance between the clusters. In this method, clusters whose contained samples having nearby barycenters of samples contained are merged as similar clusters.

Ward method: The sum of the square of Euclidian distances in the clusters is minimized when merging clusters.

Average distance: An average value of all the distances between individuals belonging to each cluster is used as the distance between the clusters.

By any of these classification methods, clusters with a “shortest distance” relationship are assumed to be similar to each other and merged to make an upper-hierarchy cluster. Once clusters in one hierarchy are generated, distances between the generated clusters are again calculated and a distance matrix is generated, and additional upper-hierarchy cluster is generated by calculating for clusters with a minimum distance. In this way, eventually a dendrogram is generated.

In a dendrogram, the samples in a merged cluster at a certain hierarchy have been merged based on a certain similarity relationship. These similar samples can be considered to possess a certain common property, and by analyzing that property it becomes possible to clarify the characteristics of the group of those clusters. For example, when the malignancy is used as an indicator and the samples are viewed in the light of whether they are malignant or not, it is possible to clarify that those cancers belonging to some clusters are malignant and others belonging to other clusters are not.

For example, when, focusing attention on estrogen receptors, certain genes are selected by variance analysis and subjected to cluster analysis, breast cancer samples can be classified into: (i) a group of cases most of which are estrogen receptor-positive; (ii) a group of cases of which most are estrogen receptor-negative; and (iii) a group of cases of which some are estrogen receptor-positive and others are negative. By examining which group a sample to be predicted belongs to, it becomes possible to predict the degree of malignancy, such as whether metastasis or recurrence is likely to occur or not.

Reliability between branches in a dendrogram generated by hierarchical cluster analysis may be calculated by the Bootstrap method, for example. In this method, an empirical probability distribution is assumed that gives a probability of 1/n to each of n samples randomly extracted. Then n random samples are considered (extracted) that allow for overlap from the probability distribution. These randomly re-extracted samples give predicted values which are called bootstrap replicates. The random re-extraction is repeated B times to give B bootstrap replicates, based on which bootstrap estimates of a variance (error) from the original predictions are calculated. The Bootstrap method can be used for evaluating reliability when the normality of probability distribution cannot be assumed or its distribution cannot be fully understood due to complicated statistics. The Bootstrap method is a statistical method well-known to those skilled in the art, and a number of software applications for it are also known. Examples of software useful for the present invention include GeneMaths™ (Applied Maths) and Amos (E-works).

New cancer specimens can be classified based on the classification obtained by cluster analysis, by multivariate analysis such as cluster analysis and discriminant analysis. Examples of the method using cluster analysis include one by which the data of specimens used for the classification and the data of specimens to be predicted are simultaneously subjected to cluster analysis. In another example, the branchings of the dendrogram are traced backwards for classification. When the criteria are simple, classification can be performed by arithmetical computation.

(1-2) Non-hierarchical Cluster Analysis

Examples of non-hierarchical cluster analysis include a method using a self-organizing map (SOM) and the K-means method.

The method using a self-organizing map classifies cancers at individual nodes arranged in k dimensions. The self-organizing map technique is similar to cluster analysis except that all the cancers are re-classified for each operation. The method by the self-organizing map can be used in the two stages of classification of expression patterns and prediction of cancer, as in hierarchical cluster analysis. Further, by performing SOM in combination with the above-mentioned hierarchical cluster analysis, the order of the samples or clusters in a dendrogram can be determined (Chu, S. et al., Science, 282:699, 1998; Tamayo, P., et al., Proc. Natl. Acad. Sci. USA, 96:2907, 1999).

In the K-means method, k initial cluster centroids are appropriately determined, and all of the data are classified into clusters whose centroids they are nearest to. The barycenters of the resulting new clusters are designated as the cluster centers, and classification ends when all of the new cluster centers are identical to the previous ones. The K-means method has a high calculation efficiency and allows the result of cluster analysis to be reached in a short time.

The above-mentioned cluster analysis is a statistical technique well-known to a person skilled in the art. A number of software applications for cluster analysis are also known. Examples of such software useful for the present invention include GeneMaths™ (from Applied Maths), SAS/STAT software (from SAS Institute), and Genesight™ Version 2.0 (from Biodiscovery).

(2) Principal Component Analysis

Principal component analysis is a technique for eliminating correlations between variables from multivariate measurements and for describing the properties of the original measurements by lower-dimensional variables. In the present invention, principal component analysis is employed to eliminate “noise” contained in the gene expression information resulting from a variety of causes and to extract only variations in the gene expression. This enables statistically significant results to be obtained from the gene expression information.

For example, consider a principal component analysis in the case where there are three variables of x, y and w. A principal component is expressed by a linear combination (weighted sum) of the variables, thus: z=ax+by+cw. By substituting values of individual objects for (x, y, w), the principal component values can be obtained. Normally, each variable is normalized to a mean of 0 and a standard deviation of 1. The weight in the linear combination is a correlation coefficient between the variable and the linear component (e.g., a is the correlation coefficient for x and z).

An example of principal component analysis will be described in detail by referring to Table 4. In this example, principal component analysis is performed on n data groups consisting of p kinds of variables. A first principal component score, a second principal component score, and a third principal component score will be calculated.

As a first step of principal component analysis, a first principal component f is determined such that the loss of information possessed by data as a characteristic can be minimized. Specifically, based on the data shown in Table 4, the values of a1, a2, a3, . . . , ap of an eigenvector A=(a1, a2, a3, . . . , ap) of the first principal component f are determined such that the variance of f can be maximized. The values of a1, a2, a3, . . . , ap are calculated such that a1¹+a2²+a3²+ . . . ap²=1. The first principal component scores fl to fn, which indicate the amount of information possessed by individual data, are expressed by the following equations:
$\begin{matrix} f1 = a1 \cdot x1l + a2 \cdot x12 + a3 \cdot x13 f2 = a1 \cdot x21 + a2 \cdot x22 + a3 \cdot x23 ⋮ fi = a1 \cdot xil + a2 \cdot xi2 + a3 \cdot xi3 ⋮ fn = a1 \cdot xn1 + a2 \cdot xn2 + a3 \cdot xn3 & (VII) \end{matrix}$

The more the individual values of fi vary, the more clearly can the characteristics of each data be understood. Therefore, the greatest amount of information can be absorbed by the first principal component f when the variance of f is at a maximum.

Similarly for the second principal component, the values of b1, b2, b3, . . . , bp in an eigenvector B=(b1, b2, b3, . . . , bp) of the second principal component g are calculated such that the loss in the amount of information that cannot be absorbed by the first principal component can be minimized. When the second principal component score for the ith data is gi, gi can be expressed as gi=b1·xi1+b2·xi2+b3·xi3.

Similarly for the third principal component, the values of c1, c2, c3, . . . , and cp in an eigenvector C=(c1, c2, c3, . . . , cp) of the third principal component h are calculated. When the third principal component score for the ith data is hi, hi can be expressed as hi=c1·xi1+c2·xi2+c3·xi3.

Specifically, variance and covariance matrices are obtained from the data in Table 4, and the individual components are calculated from such eigenvalues and eigenvectors that the variance is maximized.

The above-described method of principal component analysis is a statistical technique well-known to a skilled person. A number of software applications for principal component analysis are known. Examples of such software useful for the present invention include GeneMaths™ (from Applied Maths) and SAS/STAT software (from SAS Institute).

(3) Discriminant Analysis

Discriminant analysis is an analysis method for statistically determining, from multivariate data, to which of a number of groups or populations an individual belongs, and analyzing the validity of such discrimination. The discrimination is basically carried out by defining the distance between an individual to be discriminated and each of the groups, and predicting that the individual belongs to the group of the shortest distance. When the number of characteristics to be referred to is one, the statistical distance is determined as:

(Individual measurement−group mean)/(standard deviation of the group) (VIII)

In general, however, Mahalanobis distance, which is extended from the above, is often used.

In the present invention, based on the classification obtained as a result of cluster analysis, a discriminant function for discriminating this classification based on gene expression pattern is created. Using this discriminant function, which group each of the cases to be predicted belongs to is discriminated (determined).

When the variables for multivariate analysis are viewed in terms of the presence or absence of expression of a particular gene or the level of expression, the cases (subjects) can be classified into a group in which a particular gene is expressed at high levels and another group in which the same gene is expressed at low levels. The particular gene may be suitably selected depending on the above-mentioned ratio of total variation to variation within subgroup. By examining to which group a subject specimen belongs based on the result of cluster analysis, it becomes possible, for example, to predict the likelihood whether metastasis or recurrence will occur or not.

4. Prediction of Cancer

The state of cancer is predicted based on the result of multivariate analysis described above. For this purpose, expression patterns characterizing different states of cancer are determined. The states of cancer includes the presence or absence of cancer, and the degree (stage) of progress of cancer. For example, the states of cancer include: (a) whether or not the patient suffers from cancer (presence or absence of cancer); (b) if there is cancer, what degree of its malignancy is (cancer malignancy); (c) whether or not it has metastasized; and (d) what the chances of its recurrence are. Examples of the indices for determining the malignancy include instances of early recurrence, how long the patient has to live, and tumor size.

Multivariate analysis of the above result of gene expression can provide classification results consisting of a group relating to lymph node metastasis and/or early recurrence and a group not relating to either of them. Since lymph node metastasis and recurrence are closely related to prognosis and the malignancy of cancer, they are important factors when predicting prognosis. The frequency of appearance of hormone receptors, lymph node metastasis and recurrence is statistically significantly different for each group. Accordingly, it becomes possible to predict prognosis for new cases by: examining the expression level of the genes having the sequences 1-27 from Table 1, 28-153 from Table 2, and/or 154-289 from Table 3 (preferably, genes having sequences 1-21 from Table 1, sequences 30, 33, 34, 36-42, 44-48, 54-66, 68, 69, 71, 74, 80, 82, 83, 85, 93, 100-103, 112, 115, 116, 118-121 from Table 2, and/or sequences 155, 162, 163, 167-169, 171, 172, 174, 175, 177-180, 188, 190, 193, 198, 211, 222, 242-253, 255-257, 259-261, 263, and 265 from Table 3), and, optionally, other genes which could be considered useful for the classification of cancer, using the method described in the section “1. Quantitative determination of gene expression”; or quantitatively determining the protein products encoded by those genes using the method which will be described in the section “6. Preparation and detection of an antibody,” thereby determining to which of the existing groups of cancer the expression pattern of the specimen belongs.

5. Cancer State Identification System

The identification system of the present invention comprises (a) means of analyzing the expression level of a gene isolated from a test sample; and (b) means of predicting the state of cancer by using the result of analysis as an indicator. The analysis means (a) further comprises a means (also called a detection engine) of detecting the expression level of each of a plurality of genes in a cancer cell or tissue derived from a primary cancerous lesion and in a normal tissue, and a means (also called an analysis engine) of analyzing the resultant detection values.

(1) Gene Expression Detection Engine

In the present invention, the detection data obtained as described above may be converted into digital information and used for the detection of gene expression.

(2) Analysis Engine

The analysis engine is a means of performing multivariate analysis, for example cluster analysis, based on the data (gene expression level) provided by the detection engine. This analysis process can classify the genes into a group of genes with high expression levels and a group of genes with low expression levels. Further, this means can classify the samples into estrogen-receptor expression positive, negative and positive/negative mixed groups, for example.

FIG. 3 shows the block diagram of an example of the prediction system according to the invention.

The prediction system of FIG. 3 comprises a CPU 301, a ROM 302, a RAM 303, an input unit 304, a transmitter/receiver unit 305, an output unit 306, a hard disc drive (HDD) 307, and a CD-ROM drive 308.

CPU 301 controls the cancer state prediction system as a whole according to a program stored in ROM 302, RAM 303 or HDD 307, and executes a prediction process to be described later. ROM 302 stores the program, such as for commanding performing processes necessary for the operation of the prediction system. RAM 303 temporarily stores data necessary for executing the prediction process. The input unit 304 includes a keyboard and/or a mouse, for example, which is operated when inputting necessary conditions for executing the prediction process. The transmitter/receiver unit 305 executes a data transmission/reception process with a database 310, for example, via a communication line based on the commands from CPU 301. The output unit 306 displays various conditions or expressed gene detection data inputted via the input unit 304, according to commands from CPU 301. Examples of the output unit 306 include a computer display and a printer. HDD 307 stores the expression pattern information about various kinds of genes in a cell or tissue. It reads the stored program, data or the like in response to commands from CPU 301, and stores it in RAM 303, for example. Based on commands from CPU 301, the CD-ROM drive 308 reads a program, data or the like from a prediction program stored in a CD-ROM 309, and stores it in RAM 303, for example.

CPU 301 supplies the data received from the input unit, for example, to the output unit 306, while executing the prediction for the likelihood of metastasis or recurrence of cancer on the basis of the data received from the stored database. The database refers to the storage of information relating to the level of gene expression obtained as described above (including both an absolute level and a relative level).

FIGS. 4 and 5 shows the flowchart of an example of prediction of the states of cancer executed by the program shown in FIG. 3 in the case where gene expression patterns were analyzed.

Referring to FIG. 4, a cluster analysis device 401 will be hereafter described as an example of the multivariate analysis device. The cluster analysis device 401 generates clusters for the above prediction process. First, gene expression data is fed via an external database search/input means 402. The external database search/input means 402 has a function to access the existing, various kinds of external databases, preferably by using a predetermined keyword, in order to collect sample data to be subjected to multivariate analysis (such as cluster analysis). The above data input operation is repeated until data input is finalized. The information obtained from individual tissues or cells is stored in a sample data storage means 403 by the input of data, subjected to cluster analysis, or registered in the database.

The sample data stored in the sample data storage means 403 is inputted to a data optimization means 404, where the data is optimized for multivariate analysis. Examples of data optimization include standardization by a median, standardization by a z-score, setting of a maximum and minimum value, and logarithmic transformation, of which the one most suitable for the samples used can be selected.

A variable list output means 405 displays a list of the variables of the sample data to be analyzed for example, by cluster analysis.

Next, the user selects variables from the variables displayed on the list by the variable list output means 405, using the function of a variable selection means 406.

The selection of the variables using the variable list output means 405 is carried out such that the user can freely select one or more particular variables. Typically, since there are a number of candidates for variables, the user should be able to select any of those candidates.

As the user selects particular variables, this information is inputted to an evaluation sample data file generating means 407, together with the sample data. The evaluation sample data file generating means 407 generates a data file for the evaluation samples.

The data file for the clusters for evaluation is then transmitted to an evaluation means 408, where the degree of cluster separation is evaluated. The evaluation formula for the evaluation of the degree of cluster separation can be defined in various ways.

The result of the evaluation of cluster separation degree by the evaluation means 408 is given to a cluster classification means 409. The cluster classification means 409 receives the evaluation result from the evaluation means 408, refers to evaluation conditions set in an evaluation condition setting means 412, determines an optimum cluster classification, and, in the case where a cluster classification continuation/termination condition is set, determines whether cluster classification should be continued or terminated. In the case where the cluster classification continuation/termination condition is not set, the cluster classification means 409 lets the user decide whether cluster classification should be continued or terminated. If the user chooses to continue with cluster classification, the cluster classification means 409 outputs the optimum cluster classification obtained in the most recent procedure, and a signal for the continuation of cluster classification. The signal for the continuation of cluster classification later constitutes a command for bringing the procedure back to the process in the variable list output means 405 after the process in a dendrogram editing means 411.

On the other hand, if the cluster classification means 409 has decided to discontinue the cluster classification operation, cluster classifications that are optimal at that point in time are identified, and a signal for the discontinuation of the cluster classifying operation is output. This signal for the discontinuation of the cluster classifying operation later constitutes a command for terminating the cluster analysis process after the process in the dendrogram editing means 411 is performed.

After the process in the cluster classification means 409 is completed, the process in a dendrogram generating means 410 is initiated. The dendrogram generating means 410 receives the cluster classification determined by the cluster classification means 409, and displays a dendrogram based on the cluster classification and the attributes of the variables relating to individual cluster classifications. The cluster classification dendrogram thus generated by the dendrogram generating means 410 allows the user to visually grasp the current state of cluster classification. Together with the generation of the dendrogram, the dendrogram generating means 410 also displays colored, patterned or otherwise decorated cells to allow the user to visually grasp the gene expression levels on whose basis the dendrogram was generated. Next, the dendrogram editing means 411 lets the user edit the cluster classification dendrogram generated by the dendrogram generating means 410 by addition, modification or deletion of the cluster classification on the screen of the display device. The addition, modification or deletion of the cluster classification is carried out by the user: designating a particular cluster and further designating the variable of a cluster which is to be classified lower than that particular cluster; merging a plurality of clusters; or deleting the branch of a certain cluster classification, for example, using a processing instruction input device on the screen. The dendrogram editing means 411 provides a variety of tools for assisting the user's editing operation on the screen. The dendrogram editing means 411 reads the significance of each revision of the cluster classification by the user and automatically modifies the data file for each cluster according to that significance. Preferably, the dendrogram editing means 411 asks the user whether the cluster classification by the cluster classification means 409 should be continued or terminated and lets the user input a final decision.

As a result, if the repetition of cluster classification is to be continued, the process is returned to the variable list output means 405, and the processes from the variable list output means 405 to the dendrogram editing means 411 are repeated.

Based on the thus analyzed data, the state of cancer such as the possibility of metastasis or recurrence can be determined by examining to which cluster the cancer specimen to be tested has been classified.

FIG. 5 shows an device for predicting the result of cluster analysis. A prediction device 501 constitutes a processing means by which a data file and an evaluation condition that is set via a cluster 513 outputted by the cluster analysis device of FIG. 4 can be integrated in an evaluation means 508, the data file being obtained via an external database search/input means 502, a sample data storage means 503, a data optimizing means 504, a variable list output means 505, a variable selection means 506 and an evaluation sample data file generating means 507. The individual means from the external database input means 502 to the evaluation sample data file generating means 507 perform the same processes as those in the cluster analysis device shown in FIG. 4. In the case where a prediction process is to be performed on the basis of the clusters which are the output of FIG. 4, a cluster 513 is inputted to an evaluation condition setting means 512. Then, an evaluation means 508, a prediction means 509, a prediction result generating means 510 and a prediction result editing means 511 perform their individual processes. When certain sample data is to be subjected to the prediction process together with the clusters which have been obtained as output of FIG. 4, the sample data is processed from the external database search/input means 502 to the evaluation sample data file generating means 507, and then integrated in the evaluation means 508 with the cluster data from the evaluation condition setting means 512.

After the process in the prediction means 509 is completed, the prediction result generating means 510 starts its process. The prediction result generating means 510 receives the prediction result produced in the prediction means 509 and displays a chart (figure) based on that prediction result and the attributes of the variables relating to the individual cluster classifications. Based on the prediction result chart generated by the prediction result generating means 510, the user can visually grasp the predicted state. In addition to the prediction result chart, the prediction result generating means 510 displays the levels of gene expression on which the chart was based, by means of letters and/or colored or patterned cells, so that the user can visually grasp the gene expression levels. Thereafter, the prediction result editing means 511 lets the user edit the prediction result chart generated by the prediction result generating means 510, by way of addition, modification and/or deletion of the cluster classifications on the screen of the display device. The prediction result editing means 511 provides a variety of tools assisting the user's editing operations on the screen. The prediction result editing means 511 reads the significance of each revision of the prediction result by the user and automatically modifies the data file of each prediction result according to that significance. Preferably, the prediction result editing means 511 asks the user to select whether the prediction operation by the prediction means 509 should be continued or terminated, so that the user can input his or her final decision.

If a repetition process for prediction is to be continued, the procedure returns to the variable list output means 505, and the above-described processes from the variable list output means 505 to the prediction result editing means 511 are repeated.

For example, when expression levels for 10 or more genes in 100 to 500 cases are measured, these data are stored as population data and cluster analysis is performed on the data for the genes to be analyzed, together with the parent (population) data, so that the genes to be analyzed can be classified into one or another group. If a particular classified group has a low probability of cancer metastasis or recurrence, it can be predicted that it is unlikely that the cancer in the individual as a subject of the cluster analysis will metastasize or recur.

The present invention provides not only the program for the means for predicting the metastasis or recurrence of cancer, but also a recording medium in which that program is recorded. The recording medium may be computer-readable. Examples of the medium include a floppy disc (FD), a magneto-optical disc (MO), a CD-ROM, a hard disc, a ROM and a RAM.

6. Production of Antibody and Detection

In the present invention, in order to measure the level of gene expression, a protein product encoded by that gene can be quantitatively determined. The protein product can be immunologically quantitatively determined by using an antibody against the protein. Hereafter, the method of production of the antibody and its quantitative determination will be described.

(1) Expression and Purification of a Protein

(i) Production of an Expression Vector

A recombinant vector for expression of a protein can be obtained by linking the above-mentioned gene to an appropriate vector. A transformant can be obtained by introducing the recombinant vector into a host so that the target gene can be expressed.

As the vector, a phage or plasmid that is capable of autonomously growing in a host microorganism is used. Examples of a plasmid DNA include those derived from Escherichia coli, Bacillus subtilis and yeast. An example of a phage DNA is lambda phage. Further, animal viruses such as retrovirus and vaccinia virus, and insect virus vectors such as baculovirus can be used.

In order to insert the gene according to the invention into the vector, a method is adopted, for example, whereby purified DNA is cleaved by an appropriate restriction enzymes and inserted into a restriction enzyme site or a multi-cloning site of an appropriate vector DNA to ligate to the vector.

For ligating the DNA fragment to the vector fragment, a known DNA ligase is used. The DNA fragment and the vector fragment are annealed and ligated, thereby producing a recombinant vector.

The host to be used for transformation is not particularly limited as long as it allows the target gene to be expressed therein. Examples of the host include bacteria (such as E. coli. and Bacillus subtilis), yeast, animal cells (such as COS cells and CHO cells), and insect cells.

The gene can be introduced into the host by a known method (such as a method using calcium ions, electroporation, a spheroplast method, a lithium acetate method, a calcium phosphate method, lipofection, etc.).

(ii) Preparation of a Protein

In the present invention, the protein which is expressed by the above gene can be obtained from a culture of the above transformant possessing the target gene. The “cultured product” refers to any of (a) culture supernatant, or (b) a cultured cell or cultured microorganism, or homogenate thereof. The transformant of the invention is cultured in a culture medium by a usual method of cultivating a host. Culturing is typically performed by shaking culture or aeration culture with stirring. During culturing, antibiotics such as ampicillin or tetracycline may be added to the medium as needed.

After culturing, in the case where the intended protein is produced in the microorganism or cell, the protein is extracted by homogenizing the microorganism or cell. In the case where the intended protein is secreted from the microorganism or cell, the culture medium is used as is, or the microorganism or cell is removed by centrifugation, for example. Thereafter, the intended protein can be isolated from the culture and purified by a conventional biochemical method for the isolation and purification of proteins, such as ammonium sulfate precipitation, gel chromatography, ion-exchange chromatography, affinity chromatography, either individually or in combination. Whether the intended protein have been obtained or not can be confirmed by SDS polyacrylamide gel electrophoresis, for example.

In the present invention, not only the entire purified protein but also its partial fragments can be used. The term “partial fragments” is used herein for fragments regardless of their length as long as they contain amino acid residues selected from the amino-acid sequences of proteins encoded by the genes 1-289 from Tables 1-3 or, in some cases, the other genes having equivalent functions to the above genes.

The partial fragments can be prepared in the form of peptide fragments by conventional peptide synthesis, for example. Peptides may be chemically synthesized in a conventional manner. Such conventional synthesis includes an azide method, an acid chloride method, an acid anhydride method, a mixed acid anhydride method, a DCC method, an activated ester method, a carboimidazole method, and an oxidation-reduction method. The synthesis may be performed by either a solid-phase or liquid-phase method. Further, in the present invention, the synthesis may be performed by a commercially available automatic peptide synthesizer (such as the automatic peptide synthesizer PSSM-8) from SHIMADZU Corporation).

(2) Preparation of an Antibody

The term “antibody” herein refers to an antibody molecule as a whole or its fragments (such as Fab or F(ab′)₂fragments) which can bind to the above-mentioned protein or its partial fragments as the antigen. The antibody may be either a polyclonal antibody or a monoclonal antibody. In the present invention, the antibody (polyclonal or monoclonal antibody) can be generated by e.g. the following method.

(i) Monoclonal Antibody

The prepared protein or its fragments is administered as an antigen to a mammal, such as a rat, mouse, or rabbit. An adjuvant such as Freund's complete adjuvant (FCA) or Freund's incomplete adjuvant (FIA) may be used as needed. The immunization is performed mainly by intravenous, subcutaneous, or intraperitoneal injection. The interval of immunization is not particularly limited and immunized one to ten times at the intervals of several days to weeks. Antibody-producing cells are collected one to 60 days after the last day of immunization. Examples of the antibody-producing cell include a pancreatic cell, a lymph node cell, and a peripheral blood cell.

To obtain a hybridoma, an antibody-producing cell and a myeloma cell are fused. As the myeloma cell to be fused with the antibody-producing cell, a generally available established cell line can be used. Preferably, the cell line used should have a drug selectivity and properties such that it cannot survive in a HAT selective medium (containing hypoxanthine, aminopterin, and thymidine) in unfused form and can survive only when fused with an antibody-producing cell. The myeloma cell may include, for example, mouse myeloma cell lines such as P3X63-Ag. 8. U1(P3U1) and NS-1.

Next, the myeloma cell and the antibody-producing cell are fused. For the fusion, the cells are mixed (preferably at the antibody-producing cell to myeloma cell ratio of 5:1) in a culture medium for animal cell which does not contain serum, such as DMEM and RPMI-1640 media, and fused in the presence of a cell fusion-promoting agent (such as polyethylene glycol). The cell fusion may also be carried out by using a commercially available cell fusion device using electroporation.

The desired hybridoma is selected from the post-fusion cells. For example, a cell suspension is appropriately diluted in e.g. the RPMI-1640 medium containing fetal bovine serum and then plated on a microtiter plate. A selection medium is added to each well, and the cells are cultured with appropriately replacing the selection medium. As a result, the cells which grow about 14 days after the start of culturing in the selection medium can be obtained as the hybridoma.

The culture supernatant of the grown hybridoma is then screened for the presence of an antibody which reacts with the intended protein. This can be carried out in a conventional manner, such as by enzyme immunoassay or radioimmunoassay, for example. The fused cells are cloned by the limiting dilution to establish a hybridoma which produces the desired monoclonal antibody.

Examples of the method of collecting the monoclonal antibody from the established hybridoma include the conventional cell culture method and ascites production method.

If it is necessary to purify the antibody in the above-described antibody collecting method, a known method such as ammonium sulfate precipitation, ion exchange chromatography, gel filtration, or affinity chromatography, or a combination thereof, may be used.

(ii) Production of Polyclonal Antibody

In order to prepare a polyclonal antibody, immunization step is conducted in an animal in the same manner as described above. After 6 to 60 days from the last day of immunization, antibody titer is measured by enzyme-linked immunosorbent assay (ELISA), enzyme immunoassay (EIA), or radioimmuno assay (RIA), for example. Blood is collected on the day when the maximum antibody titer was measured, to obtain an antiserum. Thereafter, the reactivity of the polyclonal antibody in the antiserum is measured by ELISA, for example.

(3) Detection

The protein can be detected by a known technique such as Western blotting, RIA or ELISA. A commercially available kit may also be used for detecting the protein.

7. Drug Design Based on the Result of the Method According to the Invention

Systems are being designed for designing compounds which specifically inactivate an active site of a target molecule related to the development of a disease, or screening compounds for recovering the function of an inactivated protein by changing its conformation. If the underlying differences in the mechanism causing the diseases with the same diagnosis or similar symptoms could be clarified at the molecular level, treatment can be tailored to individual needs (“Personalized medicine”) by, for example, using different drugs for different mechanisms.

It is known that the state of cancer (such as malignancy) is determined not only by the gene causing the cancer but also other genes. The expression of those genes varies from person to person. In the present invention, the gene expression patterns are influenced by genes that are un-related to cancer, as well as by the cancer-causing gene. The present invention takes advantage of the result of expression of genes exhibiting such a relationship with the state of cancer to target certain of those genes and design a drug useful for cancer treatment, in order to reduce cancer malignancy and treat cancer. Specifically, the gene expression in a cancer specimen whose state of cancer (such as the presence or absence of cancer, malignancy, presence or absence of metastasis or recurrence) is predicted to be high-risk according to the method of the invention can be regulated such that the specimen has an expression pattern which is predicted to be low-risk. For example, the invention enables a drug to be designed which can suppress or enhance gene expression such that an expression pattern characteristic of high grade of malignancy can be turned into an expression pattern characteristic of low grade of malignancy. “High-risk” herein refers to a state where there is at least one of the following states: a state where the pathological malignancy of the cancer is high; a state where metastasis occurs at more than one place; a state where more than one kinds of cancer are present; and a state where there is a recurrence of cancer within 36 months of healing. “Low-risk” herein refers to a state where the pathological malignancy of the cancer is not high, the state where there is no metastasis, or the state where the cancer does not recur for more than five years. These states are only exemplary and may be changed as the treatment methods are improved.

The invention can therefore reduce the likelihood of a metastasis or recurrence of cancer and reduce the malignancy. It can also provide effective preventive treatment (including prevention for metastasis and recurrence) or therapeutic treatment against high-malignancy cancers.

First, target genes whose expression is to be regulated are selected. Based on the result of gene expression patterns cancer specimens whose malignancy is predicted to be high according to the method of the invention, the genes are classified into a group of genes with high expression patterns and another group of genes with low expression patterns, and each of the thus classified genes is used as a target. More than one target gene can be selected. A plurality of genes used for cluster analysis may also be used as targets.

After determining the target genes, a drug is designed which can regulate the expression of the genes or the activity of the gene products. “Regulate the expression of the genes or the activity of gene products” herein refers to blocking, reducing, enhancing and/or facilitating the expression of the genes or the activity of the gene products.

In the case where the expression of a gene is to be suppressed, a drug is designed by which the expression of the gene can be directly suppressed. An example of a conventional method is the antisense method. Alternatively, a drug may be designed by which the function of a gene expression product (protein) can be suppressed. In this case, an antibody against the protein may be used. Further, an inhibitor of the protein activity may also be used.

The antisense method involves having an antisense sequence specifically bind to the sequence of the target gene and suppressing the expression of a target gene. Preferably, the expression of a gene that expresses at high levels is suppressed. “Expresses at high levels” means that the intracellular level of mRNA is higher than average values. An antisense sequence is a nucleic acid sequence that can specifically hybridize to at least a part of a target sequence. The antisense sequence binds to the cellular mRNA or genome DNA and blocks its translation or transcription, thereby blocking the expression of the target gene. For the antisense sequence, any nucleic acid substance may be used as long as it can block the translation or transcription of the target gene. For example, such nucleic acid substance includes DNA, RNA and any desired nucleic acid mimetics. Thus, genes expressed in a cancer specimen with high malignancy are selected from the genes having the nucleotide sequences 1-289 from Tables 1-3 and/or, in some cases, other genes having similar functions, and an antisense nucleic acid (oligonucleotide) sequence is designed such that it is complementary to a part of the sequence. Examples of the target genes whose expression is to be suppressed in the present invention include the genes having sequences 4 and 7 from Table 1; sequences 28, 29, 31, 32, 35, 43, 49-53, 67, 70, 72, 73, 75-79, 81, 84, 86-92, 94-99, 104-111, 113, 114, 117, and 122-153 from Table 2; and sequences 155, 162, 163, 167-169, 171, 172, 174, 175, 177-180, 188, 190, 193, 198, 211, 222, 242-253, 255-257, 259-261, 263 and 265 from Table 3. Preferably, one or more of these genes are used.

The length of the antisense nucleotide acid sequence to be designed is not particularly limited as long as it can suppress the expression of the target gene. . The length may be, for example, 10-50 bases long or, preferably, 15-25 bases long. Oligonucleotides can be readily chemically synthesized by known methods.

The antisense sequence can be delivered to a target location (such as a cancer cell) by a variety of known administration methods employing an expression vector. Examples of the administration methods include a method using a recombinant expression vector such as a chimera virus or a colloidal dispersion system, and a method employing a variety of viral vectors including a retrovirus vector and adeno-associated virus vector.

Molecular analogs of an antisense oligonucleotide may also be used for the purposes of the present invention. A molecular analog has high stability and distribution specificity, for example. An example of the molecular analog is an antisense oligonucleotide linked to a chemically reactive group, such as iron-binding ethylenediaminetetraacetic acid.

Examples of the vector that can be used for antisense gene therapy include, but are not limited to, adenovirus, herpesvirus, vaccinia virus, and RNA viruses such as asretrovirus.

Other examples of the gene delivery system that can be used for administering the antisense sequence to the target tissue or cell include a colloidal dispersion system, a liposome-induced system, and an artificial virus envelope. Specifically, a macromolecular complex, a nano-capsule, a microsphere, beads, oil-in-water type emulsion, micelle, mixed micelle, and liposome, for example, may be used as a delivery system.

According to the drug design of the invention, an antisense oligonucleotide that can bind (preferably specifically bind) to the sequence of the target gene determined on the basis of the result obtained by the cancer prediction method of the invention is administered in a therapeutically effective amount in order to block the translation of the mRNA from the gene. For example, the antisense oligonucleotide may be administered systemically such as intravenously or intraarterially, as normally done; or it may be administered locally into the cancer tissue. Optionally, any of these administration modes may be used in combination with catheter techniques and surgical techniques, for example.

The dosage of antisense oligonucleotide administered may vary depending on age, sex, symptoms, administration routes, administration frequency, and dosage form. However, a conventional method in the relevant art may be appropriately selected and used.

When an antibody is used, it can be either polyclonal or monoclonal. Further, antibody fragments may be used. An antibody can be prepared by the method described above in the section “5. Preparation of an antibody and detection.”

The dosage of antibody administered may vary depending on age, sex, symptoms, administering routes, administration frequency, and dosage form. However, it may be appropriately determined by a conventional method in the relevant art.

When the antibody is administered (parenterally), various routes of administration may be selected, such as intravenous injection (including continuous infusion), intramuscular injection, intraperitoneal injection, subcutaneous injection, and suppository. In the case of a preparation for injection, the antibody is supplied in the form of a unit-dosage ampule or a multi-dosage container.

On the other hand, if the purpose is to enhance the expression of a gene, a drug is designed by which the expression of the gene can be directly enhanced. A conventional method uses a vector (targeting vector) in which the target gene is inserted. A targeting vector refers to the nucleic acid sequence of an expressed gene connected to the promoter sequence. Preferably, the vector is used such that a low-expression gene is expressed. “Low-expression” refers to the intracellular level of mRNA being lower than average values.

One method for enhancing the gene expression is to connect a strong expression regulatory sequence (promoter) to the sequence of the target gene to thereby enhance the expression of the target gene. First, a promoter which can be active in a host cell is operably liked to upstream of the target gene. By inserting this into a vector such as a viral vector, a targeting vector can be constructed which can express the target gene in the host cell at high levels. “Operably liked” herein means to link the promoter and the target gene together such that the target gene can be expressed under the control of the promoter in the host cell into which the target gene is introduced. As a result, the expression of the target gene is enhanced by the strong action of the promoter. Accordingly, a gene which is expressed at low levels in a high-malignancy cancer specimen is selected from the genes having nucleotide sequences 1-289 from Tables 1-3 and/or, in some cases, from other genes having similar functions, and a strong promoter is linked to that gene. In the present invention, examples of the target gene for expression enhancement include the genes having sequences 1-3, 5, 6, 8-19 and 21 from Table 1, sequences 30, 33, 34, 36-42, 44-48, 54-66, 68, 69, 71, 74, 80, 82, 83, 85, 93, 100-103, 112, 115, 116 and 118-121 from Table 2, and sequences 154, 156-161, 164-166, 170, 173, 176, 181-187, 189, 191, 192, 194-197, 199-210, 212-221, 223-241, 254, 258, 262, 264 and 266-289 from Table 3. Preferably, one or more of those genes are used.

Examples of the strong promoter which can be active in the host cell, for example when the host is aa animal cell, include, but are not limited to, a Rous sarcoma virus (RSV) promoter, a cytomegalovirus (CMV) promoter, an early or late promoter of simian virus (SV40), a mouse mammary tumor virus (MMTV) promoter, and a CAG promoter.

The vector into which the target gene and the promoter are inserted is a vector that can be compatible to the host cell, such as one which contains genetic information that can be replicated in the host cell and thus multiply autonomously, and which can be isolated from the host cell for purification and has a detectable marker. Accordingly, for example a cis-element such as an enhancer, a splicing signal, a poly-A addition signal, a selection marker, or a ribosome binding sequence (SD sequence), as well as a target gene and a promoter, can be connected to the vector as needed. Examples of the selection marker include a dihydrofolate reductase gene, an ampicillin-resistant gene, and a neomycin-resistant gene. Examples of the vector include, but are not limited to, in the case where a mammalian cell is used as the host cell: plasmids such as pRC/RSV and pRC/CMV (from Invitrogen); vectors containing a virus-derived autonomously replicating origin, such as bovine papilloma virus plasmid pBPV (from Amersham Pharmacia) and EB virus plasmid pCEP4 (from Invitrogen); and virus vectors such as vaccinia virus, retrovirus and adenovirus.

In the case where a vector which previously possesses a promoter being active in the host cell is used, the target gene may be inserted downstream of the promoter such that the vector-possessing promoter is operably linked to the target gene. For example, the above-mentioned plasmids pRC/RSV, pRC/CMV or the like have a cloning site downstream of the promoter which is active in an animal cell. Thus, by inserting the target gene into the cloning site and thus introducing it to the animal cell, the target gene can be expressed.

In order to insert the target gene and promoter into the vector, a method is employed by which, for example, a purified DNA is inserted into the restriction enzyme site or multicloning site of an appropriate vector DNA.

The thus prepared targeting vector may be directly administered to the patient (in vivo method). Alternatively, it may be introduced into a cell obtained from the patient, preferably a stem cell, and a cell in which the target gene is to be expressed is selected and then the cell may be administered to the patient (ex vivo method). The targeting vector may be directly administered by intravenous injection (including continuous infusion), intramuscular injection, intraperitoneal injection, and subcutaneous injection, or via other route of administration. The introduction of the targeting vector into the cell may be carried out by a conventional gene-introducing method such as, for example, a calcium phosphate method, a DEAE dextran method, electroporation, or lipofection. The selection of the cell which expresses the target gene may be carried out by utilizing a selection marker, as known in the art. The administration of the cell in which the target gene is expressed may also be carried out in the same manner as in the case of the direct administering of the targeting vector.

In another example of the drug design according to the present invention, a targeting vector into which the sequence of a target gene determined on the basis of the result of the cancer prediction method of the invention and a promoter bound to the target gene are inserted is administered in a therapeutically effective amount either directly or via a cell into which the vector has been introduced, in order to enhance the expression of the gene.

The dosage of the targeting vector administered varies depending on age, sex, symptoms, administration routes, administration frequency, and dosage form, but it may be appropriately determined by a conventional method in the art.

Alternatively, an expression product of the target gene may be directly administered. In this case, a great amount of expression products can be obtained by using a conventional recombinant protein production method. For example, the expression products of the target gene can be produced by using Escherichia coli, for example. The expression products of the target gene may be administered in the same manner as the targeting vector. The dosage of the expression products administered varies depending on age, sex, symptoms, administration routes, administration frequency, and dosage form. However, it may be appropriately determined by a conventional method in the art.

Various types of preparations may be formulated in a conventional manner by appropriately selecting pharmaceutically acceptable substances that are typically used for the formulation of preparations, such as excipient, disintegrant, lubricant, surfactant, dispersing agent, buffering agent, preservative, solubilizer, antiseptic agent, stabilizing agent, and isotonizing agent.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an outline of the cancer prediction method according to the present invention.

FIG. 2 shows a scheme of adaptor-tagged competitive PCR.

FIG. 3 shows a block diagram of a metastasis or recurrence identification system.

FIG. 4 shows a flowchart of the processes performed by a metastasis or recurrence identification program.

FIG. 5 shows another flowchart of the processes performed by a metastasis or recurrence identification program.

FIG. 6 shows the results of cluster analysis of genes from 179 cases for breast cancer.

FIG. 7 shows the results of cluster analysis of genes from 301 cases of breast cancer.

FIG. 8 shows the results of cluster analysis of genes from 115 cases for colon cancer.

FIG. 9 shows the results of cluster analysis of genes belonging to a M cluster.

FIG. 10 shows the results of cluster analysis of genes belonging to a P cluster.

FIG. 11 shows the results of principal component analysis of an M cluster.

FIG. 12 shows the results of principal component analysis of a P cluster.

FIG. 13 shows the results of principal component analysis of the M and P clusters.

DESCRIPTION OF REFERENCE NUMERALS

301: CPU, 302: ROM, 303: RAM, 304: input unit, 305: transmitter/receiver unit, 306: output unit, 307: HDD, 308: CD-ROM drive, 309: CD-ROM, 310: database,

401: cluster analysis device, 402: external database search/input means, 403: sample data storage means, 403: sample data storage means, 404: data optimization means, 405: variable list output means, 406: variable selection means, 407: evaluation sample data file generating means, 408: evaluation means, 409: cluster classification means, 410: dendrogram generating means, 411: dendrogram editing means, 412: evaluation condition setting means,

501: prediction device, 502: external database search/input means, 503: sample data storage means, 504: data optimization means, 505: variable list output means, 506: variable selection means, 507: evaluation sample data-file generating means, 508: evaluation means, 509: prediction means, 510: prediction result generating means, 511, prediction result editing means, 512: evaluation condition setting means, 513: cluster

BEST MODES OF CARRYING OUT THE INVENTION

Hereafter the present invention will be further described in detail by way of examples. It should be noted that the technical scope of the invention is not limited by these examples.

(EXAMPLE 1)
Adaptor-Tagged Competitive PCR Utilizing a Breast Cancer Specimen

The expression levels of 2412 genes were measured in 110 cases (98 cases of breast cancer, one case of male breast cancer, one case of thyroid cancer, and 10 cases of normal tissue) by an adaptor-tagged competitive PCR method.

Specifically, the tissues were homogenized and a total RNA was obtained from the above cancer or tissue by a guanidine isothiocyanate method. Then, a chemically synthesized biotinylated oligo (dT)18 primer was added to 7 μL of distilled water containing the total RNA (3 μg). The mixture was heated at 70° C. for 2 to 3 minutes, and was further maintained at 37° C. for one hour to synthesize cDNA. To the resultant single-stranded cDNA was added a reaction solution containing a DNA synthase, and they were reacted at 16° C. for one hour and then at room temperature for one hour, to synthesize double-stranded cDNA.

When the reaction had been completed, 3 μL of 0.25M EDTA (pH7.5) and 2 μL of SM NaCl were added, and phenol extraction process and ethanol precipitation process were carried out. The resultant cDNA was dissolved in 120 μL of distilled water. When the cleaving reaction with the restriction enzymes had been completed, the reaction solution was heated at 75° C. for 10 minutes, diluted with 9 volumes of distilled water and then used for a reaction for adding an adaptor, as described below.

A PCR reaction was conducted by using a gene specific primer and an adaptor primer. Each solution of the above composition was subjected to 30-35 cycles of reaction, each cycle consisting of heating at 94° C. for 30 seconds, at 55° C. for one minute, and at 72° C. for one minute. Thereafter, the solution was reacted at 72° C. for 20 minutes. When the reaction had been completed, the solution was maintained at 37° C. for one hour.

The final product was thermally denatured and then 0.5 μL of it was analyzed by ABI 3700 DNA Analyzer to determine the expression levels of each genes.

(EXAMPLE 2)
Cluster Analysis for Breast Cancer

As a group of genes useful for classification, genes satisfying the following equation were selected:

(variance in cancer specimens)/(variance in normal specimens)≧1.20

As a result, 152 genes satisfying the above equation were selected. From those 152 genes, 21 were further isolated (selected), based on differences in expression levels between estrogen receptor-positive and negative groups (p<3.85×10⁻⁵). Table 1 shows a list of the isolated genes, in which sequences 1-21 are those of the isolated genes.

Then, cluster analysis was conducted based on the expression patterns of those isolated genes. FIG. 6 schematically shows the results, in which 179 cases are arranged vertically and 21 gene names are arranged horizontally. The gene names are, for group A from left to right, GS7435, GS2307 and GS2828. For group B, from left to right, GS2632, GS7288, GS6601, GS7583, GS7116, GS7715, GS6770, GS2471, GS6711, GS1176, GS7001, GS690, GS1472, GS6784, GS7012, GS7632, GS1957 and GS7264. Each cell (square) indicates the state of expression of the gene. A white square indicates a high expression, a black square indicates a low expression, and a gray square indicates an intermediate expression level. The lighter the shade of gray, the higher the expression, and the darker the shade of gray, the lower the expression. In the present example, a low expression means the expression level of not less than −1.3 and not more than −0.3; an intermediate expression means more than −0.3 and less than 0.3; and a high expression means not less than 0.3 and not more than 1.3. “Expression level” here refers to the level calculated by standardizing measurements with a median value, limiting the standardized values within an upper value of 20 and a lower value of 0.5, and then transforming the values into a logarithm.

In FIG. 6, the numerical values in the column “L1” indicate specimen numbers given for the sake of convenience. The open circles and closed circles in the column of “L2” indicate whether or not there is expression of the estrogen receptor: an open circle indicates positive and a closed circle negative. The open and closed circles in the column of “L3” indicate whether or not there is lymph node metastasis and if there is, how many: an open circle indicates zero; a closed circle indicates one to three; and double closed circles indicate four or more. As shown in FIG. 6, the cases can be divided into four groups (I, II, III and IV), while the gene groups can be divided into two groups (A and B).

Table 5 shows the relationship between the case groups and the gene groups (Groups A and B).

TABLE 5CasegroupGroup AGroup BEstrogen receptorILow expressionHigh expressionMostly positiveIILow expressionLow expressionMixture of positiveand negativeIIIHigh expressionHigh expressionMixture of positiveand negativeIVHigh expressionLow expressionMostly negative

Table 6 shows the relationship between the case groups and lymph node metastasis.

TABLE 6CaseMetastasis presentMetastasis absentgroup(one or more)(zero)Metastasis (%)I226126.5II81044.4III18966.6IV162440Total6410438.1

Group I has less metastasis and Group III has more metastasis.

Similarly, when genes satisfying the following equation:

(variance in cancer specimens)/(variance in normal specimens)≧1.15

are selected based on differences in the level of expression between an estrogen receptor-positive and negative group, the genes having nucleotide sequences 1-27 from Table 1 are selected.

Further, if the selection is set such that

(variance in cancer specimens)/(variance in normal specimens)≧1.10,

genes other than those having nucleotide sequences 1-27 from Table 1 are additionally selected. Thus, by subjecting the levels of expression of these selected genes to multivariate analysis and dividing the cases into several groups in a similar manner, information useful for predicting prognosis can be obtained.

(EXAMPLE 3)
Estimation of Metastasis and Early Recurrence of Breast Cancer

In the present example, the prediction of metastasis and early recurrence was analysed in 301 cases of breast cancer. Cluster analysis was conducted by using the 21 genes selected in Example 2. The results were as shown below.

1. Estrogen Receptor-Positive Group (Molecular Groups 1a and 1b in FIG. 7)

In this group, lymph node metastasis was observed in 45 out of 143 cases (31%). Early recurrence was observed in 5 out of 60 cases (8%).

2. Estrogen Receptor-Positive and Negative Mixed Groups (Molecular Group 2a and 2b in FIG. 7)

Lymph node metastasis was observed in 47 of 101 cases (47%), and early recurrence was observed in 14 out of 49 cases (29%).

3. Estrogen Receptor-Negative Group (Molecular Group 3 in FIG. 7)

Lymph node metastasis was observed in 21 of 44 cases (48%), and early recurrence was observed in 4 of 10 cases (40%).

Those results are shown in Table 7.

TABLE 7Lymph nodemetastasisEarly recurrenceEstrogen receptor-positive group45/143 (31%)5/60 (8%)Estrogen receptor-positive and47/101 (47%)14/49 (29%)negative mixed groupsEstrogen receptor-negative group21/44 (48%) 4/10 (40%)

In FIG. 7, ER designates an estrogen receptor (positive +, negative −), LN designates lymph node metastasis (number), and REC designates recurrence (positive or negative).

FIG. 7 and Table 7 indicate that the chances of early recurrence in the estrogen receptor-negative group can be high. Early recurrence invariably results in death, so the results obtained by the method of the invention can provide important information for medical prognosis.

(EXAMPLE 4)
Prediction of Breast Cancer

By combining the molecular groups for the prediction of cancer obtained in Example 3 and known clinical parameters, prognosis of breast cancer can be carried out as accurately as possible. Table 8 shows the clinical parameters and their prognostic significance determined by Cox regression analysis.

TABLE 8Risk/referenceUnivariateMultivariatefactors (R.R.)R.R.p valueR.R.p valueMenstrual statusBefore/after1.40.497Tumor size>2/<2 (cm)1.80.187Lymph nodePositive/negative2.90.0122.30.048metastasisHistological gradeIII/I and II3.10.0160.218Estrogen receptorNegative/positive2.60.0300.159Molecular group2 and 3/14.00.0063.20.022

Based on the information in Table 8, prognosis of a cancer specimen can be determined from a plurality of parameters. Particularly, the R.R. value (degree of risk relative to early recurrence) is highest in the molecular group. Thus, cancer can be predicted more accurately by the molecular group than by the conventional clinical parameters.

(EXAMPLE 5)
Adaptor-Tagged Competitive PCR using a Colon Cancer Specimen

The expression levels of 1536 genes were measured in 115 cases (105 cases of colon cancer and 10 cases of normal tissue) by the adaptor-tagged competitive PCR method.

PCR reaction and the quantitative determination of the gene expression levels were carried out in the same way as in Example 1.

(EXAMPLE 6)
Selection of Genes by Cluster Analysis

Cluster analysis was performed by using the expression patterns of the above 1536 genes. FIG. 8 schematically shows the result. FIG. 7 shows the 115 cases listed vertically and the results of expression of the 1536 genes listed horizontally. As in FIG. 6, each cell (square) indicates the state of expression levels of a gene. A white cell (blank square) indicates a high expression level, a black cell (blacked-out square) indicates a low expression level, and gray indicates an intermediate level of expression. A lighter shade of gray indicates a higher expression level, and a darker shade of gray indicates a lower expression level. A low expression level refers to expression levels of not less than −1.301 and not more than −0.3. An intermediate expression level refers to expression levels more than −0.3 and less than 0.3. A high expression level refers to expression levels not less than 0.3 and not more than 1.301. As a result of the cluster analysis, the 1536 genes were divided into 88 clusters.

From the thus cluster-analyzed genes, cluster No. 14 in FIG. 8 was selected as a metastasis (M) cluster, while clusters Nos. 42-44 were selected as prognosis (P) clusters. Clusters No. 14 and Nos. 42-44 were selected because, when cluster analysis as described in Example 7 below was performed in advance, those clusters had been predicted to be related to metastasis and prognosis, respectively.

Table 2 above shows the genes contained in cluster No. 14. In Table 2, genes of sequences Nos. 28 to 153 are those selected as M cluster. On the other hand, table 3 above shows the genes contained in clusters Nos. 42 to 44. In Table 3, genes of sequences Nos. 154 to 289 are those selected as P cluster.

(EXAMPLE 7)
Multivariate Analysis (Cluster Analysis)

Cluster analysis was performed on the genes selected in Example 6. FIG. 9 shows the cluster analysis of the genes belonging to M cluster. FIG. 10 shows the cluster analysis of the genes belonging to P cluster. In FIG. 9, 115 cases are arranged vertically and 126 genes of M cluster are arranged horizontally. Each cell (square) indicates the expression level of a gene. Me indicates metastasis, and Pr indicates prognosis. The colors in a column Me are such that black represents a metastasized cancer specimen, white a cancer specimen without metastasis, and gray a normal specimen. The colors in a column Pr are such that black indicate a cancer specimen with poor prognosis, white a cancer specimen with intermediate prognosis, light gray a cancer specimen with good prognosis, and dark gray a normal specimen. “Poor” prognosis means that the patient died of cancer within two years of prognosis following the treatment of the primary cancerous lesion of colon cancer. “Intermediate” prognosis likewise means that the patient either died of cancer within two to five years or, if he is alive, less than four years of the observation period have passed. “Good” prognosis means that the patient is alive and four or more years of observation period have passed.

FIG. 10 shows 115 cases arranged vertically and 136 genes of P cluster arranged horizontally. Numerals 42, 43 and 44 indicate the cluster numbers in the cluster analysis shown in FIG. 8. Each cell (square) indicates the expression level of a gene. The colors in a column Me on the right are coded such that black, white and gray indicate a metastasized cancer specimen, a cancer specimen without metastasis, and a normal specimen, respectively. The colors in a column Pr on the right are coded such that black, white, light gray and dark gray indicate a cancer specimen with poor prognosis, a cancer specimen with intermediate prognosis, a cancer specimen with good prognosis, and a normal specimen, respectively.

FIGS. 9 and 10 show that in M cluster, cases of metastasized specimens are concentrated at the bottom, while in P cluster, specimens with bad prognosis and metastasized specimens are concentrated at the top. It was believed, therefore, that the specimens were successfully classified into individual groups by cluster analysis using these genes belonging to M or P clusters according to relevance to metastasis and clinical prognostic factors. Thus, the inventors selected M cluster as a group related to metastasis and P cluster as a group related to prognosis and metastasis.

(EXAMPLE 8)
Multivariate Analysis (Principal Component Analysis)

Principal component analysis was carried out in order to determine statistically significant values of the results of cluster analysis of M and P clusters performed in Example 7. The results are shown in FIGS. 11 and 12. In FIG. 11, a metastasized cancer specimen is indicated by “•”, a cancer specimen without metastasis by “+”, and a normal specimen by “×”. In FIG. 12, a specimen with poor prognosis is indicated by “•”, a specimen with intermediate prognosis by a square, a specimen with good prognosis by “+” and a normal specimen by “×”.

As a result of principal component analysis, a border line can be drawn, as indicated by the dashed line in FIGS. 11 and 12. Based on these figures, the values shown in Table 9 were determined. The border line indicates an average value of a first principal component.

TABLE 9Factor scores of first principal componentNegativePositiveχ²testP clusterPrognosisPoor(%)39.311.18.45ο(poor/good)(13/20) (5/45)MetastasisPositive(%)47.718.38.96ο(positive/negative)(21/23)(11/49)M clusterPrognosisPoor(%)33.319.60.017X(poor/good) (8/23)(10/42)MetastasisPositive(%)46.718.610.2ο(positive/negative)(22/24)(10/49)

The values in Table 9 indicate the evaluation of each cluster wherein when the value of a first principal component is positive value, the prediction for prognosis or metastasis is positive, and when it is negative value, the prediction is negative. The evaluation is carried out by an χ²test (χ²=6.63 when p=0.01). A value exceeding this χ²value indicates that the individual ratios are significantly different and useful for cancer prediction. Accordingly, the genes in P cluster are useful for predicting both prognosis and metastasis, while the genes in M cluster are useful for predicting metastasis.

The present inventors further conducted principal component analysis by combining M and P clusters. The results are shown in FIG. 13, in which the first principal component on the horizontal axis is that of the P cluster, and the first principal component on the vertical axis is that of the M cluster. A metastasized cancer specimen is indicated by “•” and a cancer specimen without metastasis by “33 ”. As a result of the principal component analysis, a border line can be drawn in the figure, as indicated by the dashed line. This border line indicates an average value of the first principal components. Based on FIG. 13, the values shown in Table 10 were determined.

TABLE 10Statistical significance of a combined analysisP and M2^nd, 3^rdandclusters4^thquadrants1^stquadrantχ²testPrognosisPoor(%)31.810.24.46X(poor/good)(14/30) (4/35)MetastasisPositive(%)45.011.311.9ο(positive/negative)(27/33) (5/39)factor score of 1st componentM clusterNegativePositiveχ²testMetastasisPositive(%)46.718.610.2ο(positive/negative)(22/24)(10/49)

In Table 10, the quadrants refer to the parts divided by the border lines as shown in FIG. 13. The first, second, third and fourth quadrants respectively correspond to the upper-right, lower-right, upper-left and lower-left parts of FIG. 13.

Table 10 indicates that, of the genes belonging to P and M clusters, specimens classified in the first quadrant as a result of multivariate analysis of their gene expression patterns can have a low probability (11.3%) of metastasis, while the genes classified into the other quadrants have a higher probability of metastasis. With regard to metastasis, the value of χ²test is higher in the case of combining M and P clusters than in the case of using M cluster alone. Thus, it is suggensted that based on this combination, the prediction of metastasis of colon cancer can be judged more efficiently. Because prognosis of colon cancer cannot be predicted with statistical significance by the combination of M and P clusters, as shown in Table 10, it is believed to be preferable to use the genes of P cluster, as shown in Table 9.

All publications, patents and patent applications cited herein are incorporated herein by reference in their entirety.

INDUSTRIAL APPLICABILITY

The present invention provides a cancer predicting method and a drug design method. The method according to the invention is useful for genetic diagnosis for evaluating cancer malignancy. The results according to the method of the present invention are useful for designing drugs.

Number	Date	Country	Kind
2001-73063	Mar 2001	JP	national
20001-108503	Apr 2001	JP	national
2001-234807	Aug 2001	JP	national

Method of predicting cancer

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

US Classifications

International Classifications

Abstract

Description

Claims

Priority Claims (3)

PCT Information