Claims
- 1. A computer-implemented method of identifying candidate genes from a plurality of DNA sequences, the method comprising:
obtaining results of a homology search for the plurality of DNA sequences, the homology search results comprising information about homologs of the plurality of DNA sequences; obtaining annotative information for the plurality of DNA sequences, the annotative information comprising information about the biochemical functions and physiological roles of the plurality of DNA sequences; obtaining gene expression profile data for the plurality of DNA sequences, the gene expression profile data describing behavioral patterns of the plurality of DNA sequences; clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data; storing the results of the homology search, the annotative information, the gene expression profile data, and results from clustering the plurality of DNA sequences in a database; receiving a query identifying criteria for the candidate genes; and searching the database, in response to the query, to identify a set of DNA sequences from the plurality of DNA sequences which satisfy the query criteria.
- 2. The method of claim 1 wherein the homology search for the plurality of DNA sequences comprises performing BLAST analysis, Smith-Waterman analysis, Hidden Markov Model (HMM) analysis, and EMotif analysis.
- 3. The method of claim 2 wherein performing the BLAST analysis, the Smith-Waterman analysis, the Hidden Markov Model (HMM) analysis, and the EMotif analysis comprises:
performing the BLAST analysis on-the first plurality of DNA sequences using a first database of sequences; identifying a second plurality of DNA sequences from the first plurality of sequences which are not known based on the BLAST analysis using the first database of sequences; performing Smith-Waterman analysis on the second plurality of DNA sequences using a protein database and a translated patent database; identifying a third plurality of DNA sequences from the second plurality of sequences which are not known based on the Smith-Waterman analysis; performing Hidden Markov Model (HMM) analysis and EMotif analysis on the third plurality of DNA sequences using the protein database and GenBank database; and performing BLAST analysis on the third plurality of DNA sequences using GenBank EST database.
- 4. The method of claim 1 wherein obtaining the annotative information comprises:
identifying known genes from the first plurality of DNA sequences based on the homology search; and accessing information sources storing annotative information for the known genes; and extracting the annotative information from the information sources for the known genes.
- 5. The method of claim 4 wherein extracting the annotative information comprises:
assigning a reference score to the extracted annotative information based on the level of acceptance of the role or function of the known genes as described by the annotative information such that annotative information with a high level of acceptance is assigned a higher reference score than annotative information with a low level of acceptance.
- 6. The method of claim 4 wherein the information sources include GenBank database, SWISS-PROT database, Medline database, and biomedical publications.
- 7. The method of claim 4 wherein:
accessing the information sources comprises accessing biomedical publications; extracting the annotative information comprises:
for annotative information extracted from each biomedical publication:
assigning a reference score to the extracted annotative information based on characteristics of the biomedical publication, the reference score indicating the level of acceptance of the role or function of the known genes as described by the annotative information extracted from the biomedical publication; and storing the annotative information in the database comprises storing the reference score.
- 8. The method of claim 7 wherein assigning the reference score comprises:
using a score derived from a citation index database to calculate the reference score, the score derived from the citation index database indicating the number of times that the annotative information from the biomedical publication was referenced by other information sources.
- 9. The method of claim 7 wherein assigning the reference score further comprises:
ranking the biomedical publications; and assigning the reference score to the annotative information extracted from the biomedical publication based on the ranking of the biomedical publication.
- 10. The method of claim 1 wherein clustering the plurality of DNA sequences comprises determining relationships between clusters of DNA sequences from the plurality of DNA sequences.
- 11. The method of claim 1 wherein clustering the plurality of DNA sequences comprises clustering the plurality of DNA sequences based on time-course data described by the gene expression profile data.
- 12. The method of claim 1 wherein storing the information in the database comprises correlating the annotative information for the plurality of DNA sequences with the gene expression profile data for the plurality of DNA sequences.
- 13. A method of identifying candidate genes comprising:
configuring a query identifying criteria for the candidate genes; communicating the query to a server storing information related to a plurality of DNA sequences, the information comprising:
results of a homology search for the plurality of DNA sequences, the homology search results comprising information about homologs of the plurality of DNA sequences; information about the biochemical functions and physiological roles of the plurality of DNA sequences; information describing behavioral patterns of the plurality of DNA sequences; and results from clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data; and receiving from the server, in response to the query, a first set of DNA sequences from the plurality of DNA sequences, wherein the first set of DNA sequences satisfy the criteria for the candidate genes identified in the query.
- 14. A data processing system for identifying candidate genes from a plurality of DNA sequences, the system comprising:
a processor; and a memory coupled to the processor, the memory configured to store instructions for execution by the processor, the instructions comprising:
instructions for obtaining results of a homology search for the plurality of DNA sequences, the homology search results comprising information about homologs of the plurality of DNA sequences; instructions for obtaining annotative information for the plurality of DNA sequences, the annotative information comprising information about the biochemical functions and physiological roles of the plurality of DNA sequences; instructions for obtaining gene expression profile data for the plurality of DNA sequences, the gene expression profile data describing behavioral patterns of the plurality of DNA sequences; instructions for clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data; instructions for storing the results of the homology search, the annotative information, the gene expression profile data, and results from clustering the plurality of DNA sequences in the memory; and instructions for searching the information stored in the memory, in response to a query identifying criteria for the candidate genes, to identify a set of DNA sequences from the plurality of DNA sequences which satisfy the query criteria.
- 15. The system of claim 14 wherein the memory is further configured to store instructions for performing the homology search, the instructions comprising:
instructions for performing BLAST analysis on the first plurality of DNA sequences using a first database of sequences; instructions for identifying a second plurality of DNA sequences from the first plurality of sequences which are not known based on the BLAST analysis using the first database of sequences; instructions for performing Smith-Waterman analysis on the second plurality of DNA sequences using a protein database and a translated patent database; instructions for identifying a third plurality of DNA sequences from the second plurality of sequences which are not known based on the Smith-Waterman analysis; instructions for performing Hidden Markov Model (HMM) analysis and EMotif analysis on the third plurality of DNA sequences using the protein database and GenBank database; and instructions for performing BLAST analysis on the third plurality of DNA sequences using GenBank EST database.
- 16. The system of claim 14 wherein the instructions for obtaining the annotative information comprise:
instructions for identifying known genes from the first plurality of DNA sequences based on the homology search; and instructions for accessing information sources storing annotative information for the known genes; and instructions for extracting the annotative information from the information sources for the known genes.
- 17. The system of claim 16 wherein the instructions for extracting the annotative information comprise:
instructions for assigning a reference score to the extracted annotative information based on the level of acceptance of the role or function of the known genes as described by the annotative information such that annotative information with a high level of acceptance is assigned a higher reference score than annotative information with a low level of acceptance.
- 18. The system of claim 16 wherein the information sources include GenBank database, SWISS-PROT database, Medline database, and biomedical publications.
- 19. The system of claim 16 wherein:
the instructions for accessing the information sources comprise instructions for accessing biomedical publications; the instructions for extracting the annotative information comprise:
instructions for assigning a reference score to annotative information extracted from each biomedical publication based on characteristics of the biomedical publication, the reference score indicating the level of acceptance of the role or function of the known genes as described by the annotative information extracted from the biomedical publication; and the instructions for storing the annotative information in the memory comprise instructions for storing the reference score.
- 20. The system of claim 19 wherein the instructions for assigning the reference score comprise:
instructions for using a score derived from a citation index database to calculate the reference score, the score derived from the citation index database indicating the number of times that the annotative information from the biomedical publication was referenced by other information sources.
- 21. The system of claim 19 wherein the instructions for assigning the reference score comprise:
instructions for ranking the biomedical publications; and instructions for assigning the reference score to the annotative information extracted from the biomedical publication based on the ranking of the biomedical publication.
- 22. The system of claim 14 wherein the instructions for clustering the plurality of DNA sequences comprise instructions for determining relationships between clusters of DNA sequences from the plurality of DNA sequences.
- 23. The system of claim 14 wherein the instructions for clustering the plurality of DNA sequences comprise instructions for clustering the plurality of DNA sequences based on time-course data described by the gene expression profile data.
- 24. The system of claim 14 wherein the instructions for storing the information in the database comprise instructions for correlating the annotative information for the plurality of DNA sequences with the gene expression profile data for the plurality of DNA sequences.
- 25. A system for identifying candidate genes comprising:
a communication network; a first computer coupled to the communication network; and a second computer coupled to the communication network, the second computer configured to store:
results of a homology search for a plurality of DNA sequences, the homology search results comprising information about homologs of the plurality of DNA sequences; information about the biochemical functions and physiological roles of the plurality of DNA sequences; information describing behavioral patterns of the plurality of DNA sequences; and results from clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data; wherein the first computer is configured to communicate a query to the second computer, the query identifying criteria for the candidate genes; and wherein the first computer is configured to receive from the second computer, in response to the query, a first set of DNA sequences from the plurality of DNA sequences which satisfy the criteria for the candidate genes identified in the query.
- 26. A computer program product stored on a computer-readable storage medium for identifying candidate genes from a plurality of DNA sequences, the computer program product comprising:
code for obtaining results of a homology search for the plurality of DNA sequences, the homology search results comprising information about homologs of the plurality of DNA sequences; code for obtaining annotative information for the plurality of DNA sequences, the annotative information comprising information about the biochemical functions and physiological roles of the plurality of DNA sequences; code for obtaining gene expression profile data for the plurality of DNA sequences, the gene expression profile data describing behavioral patterns of the plurality of DNA sequences; code for clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data; code for storing the results of the homology search, the annotative information, the gene expression profile data, and results from clustering the plurality of DNA sequences in a database; code for receiving a query identifying criteria for the candidate genes; code for searching the database, in response to the query, to identify a set of DNA sequences from the plurality of DNA sequences which satisfy the query criteria.
- 27. A computer program product stored on a computer-readable storage medium for identifying candidate genes, the computer program product comprising:
code for configuring a query identifying criteria for the candidate genes; code for communicating the query to a server storing information related to a plurality of DNA sequences, the information comprising:
results of a homology search for the plurality of DNA sequences, the homology search results comprising information about homologs of the plurality of DNA sequences; information about the biochemical functions and physiological roles of the plurality of DNA sequences; information describing behavioral patterns of the plurality of DNA sequences; and results from clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data; and code for receiving from the server, in response to the query, a first set of DNA sequences from the plurality of DNA sequences, wherein the first set of DNA sequences satisfy the criteria for the candidate genes identified in the query.
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application also claims priority from and is a continuation-in-part application of non-provisional U.S. patent application Ser. No. 09/365,587, entitled “SYSTEM AND METHOD FOR IDENTIFYING CRITICAL REGULATED GENES” filed Jul. 30, 1999, the entire contents of which are herein incorporated by reference in their entirety for all purposes.
Continuations (1)
|
Number |
Date |
Country |
Parent |
09628202 |
Jul 2000 |
US |
Child |
10229912 |
Aug 2002 |
US |
Continuation in Parts (1)
|
Number |
Date |
Country |
Parent |
09365587 |
Jul 1999 |
US |
Child |
09628202 |
Jul 2000 |
US |