The present application claims priority from Japanese application JP 2003-132846 filed on May 12, 2003, the content of which is hereby incorporated by reference into this application.
1. Technical Field
The present invention relates to a method of predicting the function of gene or protein, and more particularly to a method of predicting the function of a search object sequence, using a text mining technique.
2. Background Art
Conventionally, researches into genomic drug discovery are conducted through the processes of identification of individual genes by genomic study, clarification of the function of individual genes, search for and identification of drug discovery target proteins, discovery of lead compounds and optimization of structure, study of safety and pharmacodynamics, pharmacogenomic research, and clinical trial, for example. In this case, the researchers are inundated by the flood of information from the initial stage of genomic study. According to the announcement of the Human Genome Project team, there are 30 to 40 thousand human genes. Therefore, in order to investigate the validity of the human genes as a drug discovery target, tremendous amounts of cost- and time-consuming experimentation must be performed.
In order to narrow the genes/proteins that can be targets, function prediction methods employing a query sequence (newly determined sequence with unknown functions) have been proposed, of which major examples are similarity searches and motif searches. In the homology search, which is a type of similarity search, a query sequence is compared with each of the known sequences in a database. If there is a similar sequence in the database, it is predicted that the function of the query sequence is also similar to the function of the similar sequence (see Non-patent Documents 1 and 2). In the motif search, a sequence motif (localized conserved sequence pattern) characterizing a specific function group is extracted from known sequences and a library is prepared, based on which a search is conducted (see Non-patent Document 3). In both methods, public databases are searched for information concerning a sequence or a sequence group that is homologous to the sequence with unknown functions, or data in a database constructed from original data is allocated as the predicted function of a sequence with unknown functions.
[Non-patent Document 1] “Basic local alignment search tool”, Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) J. Mol. Biol. 215:403-410.
[Non-patent Document 2] “Identification of protein coding regions by database similarity search”, Gish, W. & States, D. J. (1993) Nature Genet. 3:266-272
[Non-patent Document 3] “Pfam: multiple sequence alignments and HMM-profiles of protein domains”, Sonnhammer E L L, Eddy S R, Birney E, Bateman A, Durbin R (1998) Nucleic Acids Research 26: 320-322.
For the sequences of known functions, various experiments have been conducted by researchers of various countries. Vast amounts of information obtained by the experiments are only partly stored in databases and there is much information that is not made available in the form of databases and which is believed to be hidden among papers written by researchers. Because the aforementioned similarity search and motif search are based on the information stored in databases, they have the problem of shortage of information. The most important things in drug discovery are: searching genome information (genomic sequences, full-length cDNA sequence information, and expression profile information) or SNP for drug-discovery target genes; directly reflecting the research results of structural genomics on efficient drug designing; and incorporating SNP information into clinical development early, so as to reduce the development time and achieve cost reductions. There has also been the problem that, due to the absence of means for investigating the available experiment information in an exhaustive manner, the drug-discovery targets cannot be narrowed, resulting in repeating experiments in the field in which experiments have already been conducted.
In view of these problems of the prior art, it is the object of the invention to provide a method of predicting the function of genes or proteins by extracting new information in an efficient and exhaustive manner.
In accordance with the invention, the aforementioned object is achieved by employing a method of predicting the function of sequences with unknown functions whereby reference is made to knowledge stored in as many as 10 million references, in addition to the knowledge stored in databases to which reference is made exclusively by the conventional method. The information obtained from the references is displayed to the user by means of several visualization tools in an easily understandable manner, thereby facilitating the discovery of information that is not obtainable from the database alone, or the prediction of the function of sequences.
The invention provides an information search method comprising:
Preferably, the step of searching for the documents comprises performing an associative search using documents contained in the retrieved entries as key documents. The associative search may be performed using a plurality of document databases.
The extracted feature terms are preferably classified by concept, such as disease, before being outputted. It is also effective to employ a method whereby the extracted feature terms are sorted by frequency of appearance and then displayed together with information about the frequency of appearance, or a method whereby the extracted feature terms are sorted by E-value and then displayed together with information about the E-value.
Embodiments of the present invention will be described by referring to the drawings. The present invention is based on the premise that an environment exists in which access can be made, via communications networks such as the Internet, to search engines or databases, such as public databases, in which sequence information and information about the function of proteins are stored. The invention may utilize the existing databases and search engines, and therefore their detailed descriptions are omitted.
First, a query concerning the sequence data or structure information as an object of analysis is entered (S21). What is entered as a query is sequence data about the protein that the researcher has analyzed, for example. Then, a homology search is conducted to search for sequences similar to the query (S22). Specifically, a homology search is conducted on protein amino acid sequence databases, such as SWISS-PROT, recognizing even low levels of homology. In this search, the base sequences are translated into amino-acid sequences while searching for homologous intervals.
Then, the sequences that have been found in step 22 that are homologous to the query are sorted in the order of E-value, for example, which will be described later. The results of homology search, such as the protein names, E-values, the number of relevant references, and the names of the entries in the protein amino acid database, such as the entry names of SWISS-PROT, are displayed (S23). Then, relevant references of the sequences with high homology to the query are extracted (S24). In this process, the MEDLINE IDs of the references in the entries of SWISS-PROT that have been found in step 22, or the number of documents, are determined. The relevant references with high homology to the query are then retrieved again, using the associative search engine GETA (S25). Then, keywords contained in the relevant references that have been re-retrieved and expanded by the associative search are displayed (S26). The display may show the number of references that contain the keywords in a matrix (S27), or it may show the number of cooccurence among keywords counted in the documents, in a table (S28).
The search-object sequence or structure information as the query is entered in an input box 31. In response to the entered query, a homology search is conducted on a protein amino acid sequence database, such as SWISS-PROT, recognizing even low levels of homology. This search, in which the base sequences are translated into amino-acid sequences while searching for homologous intervals, can be conducted by using known techniques, such as NCBI's BLAST (Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, Nucleic Acids Res. 25:3389-3402.). By conducting the homology search using BLAST, information concerning the sequences with high homology, such as the type of the database, accession number, the entry names of the database, scores, and E-values, can be obtained. The score refers to “a point obtained by summing up the positive values that are given when there are identical residues at the same position of two sequences arranged side by side, and the negative values that are given when the residues at the same position are different. The higher the score, the higher the homology. The E-value refers to “an expected value of the number of sequences that have the same score purely by chance in the current database.” The smaller the E-value, the smaller the chance. Thus, if the score is large and the E-value is small, it can be said that the homology between individual sequences is high. As a button 32 is pressed, a homology search by BLAST is performed, and the results are displayed as shown at the bottom of the drawing.
When a homology search is performed to retrieve sequences that are homologous to the query, such as a search object sequence or sequence information, several sequences that can be considered highly homologous are obtained. The entries of the known sequences assumed to be highly homologous are then displayed on the result display screen in the order of decreasing homology. In the illustrated example, the number of results to be displayed can be designated in an input box 34, and as many entries as the designated number are listed. The default number of sequence outputs is 50. The output items in the table include item 36 for the entry names of a protein amino acid sequence database, such as SWISS-PROT, item 37 for the value indicating the degree of homology, such as the E-value, and item 38 for the number of references. The number of references indicates the number of references in the entries that are relevant to the homologous sequences found by the search in step 22, in which an amino acid sequence database, such as SWISS-PROT, is referred to. The results of the homology search, namely the homologous sequences, are sorted by a value indicating the degree of homology, such as the E-value, and then displayed. When the E-value is used as the value indicative of the degree of homology, the sequences are sorted in increasing degrees of homology. Links are put from entry names 36, namely SWISS-PROT entry names in the example, to relevant protein amino acid sequence database pages. Links are also put from the number of references 38 to MEDLINE. As a button 33 is pressed, a KEYWORD LIST is displayed.
Now referring to
tf(d, t)=(frequency of appearance of keyword t in document d)
idf(t)=log(DBsize(db)/freq(t, db))+1
The DBsize(db) is the total number of references included in the object document database, and the freq(t, db) is the number of documents in the document database in which term t appears. The weight (d, t) of keyword t in document d is obtained by combining them both, i.e., weight (d, t)=tf(d, t)*idf (t). According to the method whereby keywords are selected from references using tf·idf, keywords with high weight are extracted from the references.
As shown in
In the display screen KEYWORD RELATION NETWORK, the nodes represented by white circles 71 indicate the keywords, and the lines (edges) 72 connecting the nodes indicate the relationships between the keywords. The color and/or thickness of the edges are varied depending on the number of cooccurence. This viewer allows the user to recognize the relevance between the keywords easily.
In the ONTOLOGY display screen, the keywords obtained from the references are sorted by as much diagonalization as possible, or by the setting of a slider bar giving a threshold for the E-value, the protein function name, the disease name, or the substance name, for example, before being displayed. In the illustrated example, the keywords are sorted by disease name on the vertical axis 73. On the horizontal axis 74, such keywords as the gene or protein names are arranged in decreasing order of importance (such as E-value). Clustering by the disease names or the like can be conducted by utilizing an ONTOLOGY database, such as G-ONTOLOGY. The display is made such that, as shown in 75, the keywords such as the gene or protein names are contained in the nodes and the cooccurence or interaction is contained in the edges. The E-value may be reflected in the density of the displayed color of the nodes. Thus the relevant keywords can be presented according to disease or protein function in a more understandable manner using ONTOLOGY, thus facilitating the function prediction operation performed by biomedical experts.
Thus, the invention facilitates the discovery or prediction of the function of a query, such as a search object sequence or structure information, from vast amounts of references related to homologous sequences with known functions. The functions extracted from the references can be visualized by means of a viewer, thus facilitating the function prediction by biomedical experts. While the prior art has been unable to provide sufficient prediction and required time-costly experimentation, due to its inability to deal with the known knowledge in an exhaustive manner, higher levels of efficiency can be obtained by the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2003-132846 | May 2003 | JP | national |