Claims
- 1. A computer implemented method for gene expression analysis comprising:
Obtaining a unique set of annotation terms for a plurality of interested genes; Performing a cluster analysis to obtain clusters for the annotation terms; and Assigning interested genes into the clusters according to their annotation terms.
- 2. The method of claim 1 wherein the gene annotation terms are GO terms.
- 3. The method of claim 2 wherein the cluster analysis is based upon pair wise similarity measures between the GO terms.
- 4. The method of claim 3 wherein at least one interested gene is assigned to a plurality of clusters.
- 5. The method of claim 3 wherein the cluster analysis is performed with a clique finding algorithm.
- 6. The method of claim 3 wherein the pair wise similarity measures are determined according to the GO digraph paths.
- 7. The method of claim 6 wherein each of the pair wise similarity measures is calculated based upon the length of partial path shared by two annotation terms.
- 8. The method of claim 7 wherein a weighing factor is assigned to each edge as a function of the level in a path.
- 9. The method of claim 8 wherein the stringency of similarity scaling may be adjusted by adjusting the weighting factor.
- 10. The method of claim 7 wherein a greedy method is used to select the longest common partial path when an annotation term is in multiple paths.
- 11. A computer implemented method for gene expression analysis comprising:
Calculating Euclidean distances between a plurality of genes based upon gene expression profiling data; Combining the Euclidean distances with gene annotation similarity matrix to generate a gene similarity matrix; and Performing a cluster analysis on the gene similarity matrix to assign genes into clusters.
- 12. The method of claim 11 wherein the gene annotation similarity matrix contains pair wise similarity measures between the GO terms.
- 13. The method of claim 12 wherein at least one interested gene is assigned to a plurality of clusters.
- 14. The method of claim 13 wherein the cluster analysis is performed with a clique finding algorithm.
- 15. The method of claim 12 wherein the pair wise similarity measures are determined according to the GO digraph paths.
- 16. The method of claim 15 wherein each of the pair wise similarity measures is calculated based upon the length of partial path shared by two annotation terms.
- 17. The method of claim 16 wherein a weighing factor is assigned to each edge as a function of the level in a path.
- 18. The method of claim 18 wherein the stringency of similarity scaling may be adjusted by adjusting the weighting factor.
- 19. The method of claim 18 wherein a greedy methods is used to select the longest common partial path when an annotation term is in multiple paths.
- 20. The method of claim 11 wherein the Euclidean distances are converted to similarity scores by subtraction from 5.
- 21. The method of claim 20 wherein the combing comprises summing the Euclidean distances with GO similarity matrix at a ratio to generate the gene similarity matrix.
- 22. The method of claim 21 wherein Fisher Exact Test is used to rank each cluster.
- 23. The method of claim 22 wherein the highest ranked clusters are used for biological interpretation.
- 24. A computer readable medium having software modules for performing the method of: Obtaining a unique set of annotation terms for a plurality of interested genes; Performing a cluster analysis to obtain clusters for the annotation terms; and Assigning interested genes into the clusters according to their annotation terms.
- 25. The computer readable medium of claim 24 wherein the gene annotation terms are GO terms.
- 26. The computer readable medium of claim 25 wherein the cluster analysis is based upon pair wise similarity measures between the GO terms.
- 27. The computer readable medium of claim 26 wherein at least one interested gene is assigned to a plurality of clusters.
- 28. The computer readable medium of claim 26 wherein the cluster analysis is performed with a clique finding algorithm.
- 29. The computer readable medium of claim 26 wherein the pair wise similarity measures are determined according to the GO digraph paths.
- 30. The computer readable medium of claim 29 wherein each of the pair wise similarity measures is calculated based upon the length of partial path shared by two annotation terms.
- 31. The computer readable medium of claim 30 wherein a weighing factor is assigned to each edge as a function of the level in a path.
- 32. The computer readable medium of claim 31 wherein the stringency of similarity scaling may be adjusted by adjusting the weighting factor.
- 33. The computer readable medium of claim 29 wherein a greedy method is used to select the longest common partial path when an annotation term is in multiple paths.
- 34. A computer readable medium having software modules for performing the method of:
Calculating Euclidean distances between a plurality of genes based upon gene expression profiling data; Combining the Euclidean distances with gene annotation similarity matrix to generate a gene similarity matrix; and Performing a cluster analysis on the gene similarity matrix to assign genes into clusters.
- 35. The computer readable medium of claim 34 wherein the gene annotation similarity matrix contains pair wise similarity measures between the GO terms.
- 36. The computer readable medium of claim 35 wherein at least one interested gene is assigned to a plurality of clusters.
- 37. The computer readable medium of claim 36 wherein the cluster analysis is performed with a clique finding algorithm.
- 38. The computer readable medium of claim 37 wherein the pair wise similarity measures are determined according to the GO digraph paths.
- 39. The computer readable medium of claim 38 wherein each of the pair wise similarity measures is calculated based upon the length of partial path shared by two annotation terms.
- 40. The computer readable medium of claim 39 wherein a weighing factor is assigned to each edge as a function of the level in a path.
- 41. The computer readable medium of claim 40 wherein the stringency of similarity scaling may be adjusted by adjusting the weighting factor.
- 42. The computer readable medium of claim 40 wherein a greedy methods is used to select the longest common partial path when an annotation term is in multiple paths.
- 43. The computer readable medium of claim 40 wherein the Euclidean distances are converted to similarity scores by subtraction from 5.
- 44. The computer readable medium of claim 43 wherein the combing comprises summing the Euclidean distances with GO similarity matrix at a ratio to generate the gene similarity matrix.
- 45. The computer readable medium of claim 44 wherein Fisher Exact Test is used to rank each cluster.
- 46. The computer readable medium of claim 45 wherein the highest ranked clusters are used for biological interpretation.
RELATED APPLICATIONS
[0001] This application claims the priority of U.S. Provisional Application Serial No. 60/297,210.
[0002] This application is related to U.S. patent application Ser. Nos. 10/026,110, 10/256,938 and ______, attorney docket 3539, titled “Statistical Analysis for Gene Ontology”, filed on Dec. 3, 2002 and U.S. patent application Ser. No. ______ Docket Number 3546, filed concurrently herewith. The cited applications are incorporated herein by reference.