Claims
- 1. A computer implemented method for characterizing a plurality of biological sequences comprising:
obtaining a plurality of models, wherein each of the models represents a classification of biological sequences with structural or functional similarity; determining fitness of the biological sequences to the models; and automatically classifying the sequences according to the distances to the models.
- 2. The method of claim 1 wherein the plurality of biological sequences have at least 50 sequences.
- 3. The method of claim 2 wherein the plurality of biological sequences have at least 100 sequences.
- 4. The method of claim 3 wherein the plurality of biological sequences have at least 100 sequences.
- 5. The method of claim 3 wherein the models are Hidden markov models.
- 6. The method of claim 5 wherein the classification is a family and each model represents a family.
- 7. The method of claim 6 wherein the sequences are protein sequences.
- 8. The method of claim 7 wherein the distances are E-values.
- 9. The method of claim 8 wherein the step of automatically determining comprises determining a step of determining a threshold for each of the models.
- 10. The method of claim 9 wherein the step of determining a threshold comprises performing a curve analysis.
- 11. The method of claim 10 wherein the step of performing a curve analysis comprises determining a point where the e-value curve drops abruptly or flattens.
- 12. A computer implemented method for gene characterization comprising:
generating libraries of models using structural relationships of known proteins; inputting a plurality of protein sequences; comparing the plurality of protein sequences with the models; automatically establishing criteria for assigning the sequences for each model; and assigning the sequences to the models based upon the criteria.
- 13. The method of claim 12 wherein the models are hidden markov models.
- 14. The method of claim 12 wherein at least 50 protein sequences are predicted protein sequences.
- 15. The method of claim 14 wherein at least 150 protein sequences are predicted protein sequences.
- 16. The method of claim 15 wherein at least 500 protein sequences are predicted protein sequences.
- 17. The method of claim 12 wherein the step of automatically establishing comprises determining a threshold for each of the models.
- 18. The method of claim 17 wherein the step of determining a threshold comprises performing a curve analysis.
- 19. The method of claim 18 wherein the step of performing a curve analysis comprises determining a point where the e-value curves drops abruptly or flattens.
- 20. A system for gene annotation comprising a processor; and a memory coupled with the processor, the memory storing a plurality of machine instructions that cause the processor to perform logical steps comprising obtaining a plurality of models, wherein each of the models represents a classification of biological sequences with structural or functional similarity; determining fitness of the biological sequences to the models; and automatically classifying the sequences according to the distances to the models.
- 21. The system of claim 20 wherein the plurality of biological sequences have at least 50 sequences.
- 22. The system of claim 21 wherein the plurality of biological sequences have at least 100 sequences.
- 23. The system of claim 22 wherein the plurality of biological sequences have at least 100 sequences.
- 24. The system of claim 23 wherein the models are Hidden markov models.
- 25. The system of claim 24 wherein the classification is a family and each model represents a family.
- 26. The system of claim 25 wherein the sequences are protein sequences.
- 27. The system of claim 26 wherein the distances are E-values.
- 28. The system of claim 27 wherein the step of automatically determining comprises determining a step of determining a threshold for each of the models.
- 29. The system of claim 28 wherein the step of determining a threshold comprises performing a curve analysis.
- 30. The system of claim 29 wherein the step of performing a curve analysis comprises determining a point where the e-value curve drops abruptly or flattens.
- 31. A system for gene annotation comprising a processor; and a memory coupled with the processor, the memory storing a plurality of machine instructions that cause the processor to perform logical steps comprising
generating libraries of models using structural relationships of known proteins; inputting a plurality of protein sequences; comparing the plurality of protein sequences with the models; automatically establishing criteria for assigning the sequences for each model; and assigning the sequences to the models based upon the criteria.
- 32. The system of claim 31 wherein the models are hidden markov models.
- 33. The system of claim 32 wherein at least 50 protein sequences are predicted protein sequences.
- 34. The system of claim 33 wherein at least 150 protein sequences are predicted protein sequences.
- 35. The system of claim 34 wherein at least 500 protein sequences are predicted protein sequences.
- 36. The system of claim 35 wherein the step of automatically establishing comprises determining a threshold for each of the models.
- 37. The system of claim 36 wherein the step of determining a threshold comprises performing a curve analysis.
- 38. The system of claim 37 wherein the step of performing a curve analysis comprises determining a point where the e-value curves drops abruptly or flattens.
- 39. A computer software product of the invention comprising a computer readable medium having computer-executable instructions for performing the method comprising:
obtaining a plurality of models, wherein each of the models represents a classification of biological sequences with structural or functional similarity; determining fitness of the biological sequences to the models; and automatically classifying the sequences according to the distances to the models.
- 40. The product of claim 39 wherein the plurality of biological sequences have at least 50 sequences.
- 41. The product of claim 40 wherein the plurality of biological sequences have at least 100 sequences.
- 42. The product of claim 41 wherein the plurality of biological sequences have at least 100 sequences.
- 43. The product of claim 42 wherein the models are Hidden markov models.
- 44. The product of claim 43 wherein the classification is a family and each model represents a family.
- 45. The product of claim 44 wherein the sequences are protein sequences.
- 46. The product of claim 45 wherein the distances are E-values.
- 47. The product of claim 46 wherein the step of automatically determining comprises determining a step of determining a threshold for each of the models.
- 48. The product of claim 47 wherein the step of determining a threshold comprises performing a curve analysis.
- 49. The product of claim 48 wherein the step of performing a curve analysis comprises determining a point where the e-value curve drops abruptly or flattens.
- 50. A computer software product of the invention comprising a computer readable medium having computer-executable instructions for performing the method comprising:
generating libraries of models using structural relationships of known proteins; inputting a plurality of protein sequences; comparing the plurality of protein sequences with the models; automatically establishing criteria for assigning the sequences for each model; and assigning the sequences to the models based upon the criteria.
- 51. The product of claim 50 wherein the models are hidden markov models.
- 52. The product of claim 51 wherein at least 50 protein sequences are predicted protein sequences.
- 53. The product of claim 52 wherein at least 150 protein sequences are predicted protein sequences.
- 54. The product of claim 53 wherein at least 500 protein sequences are predicted protein sequences.
- 55. The product of claim 54 wherein the step of automatically establishing comprises determining a threshold for each of the models.
- 56. The product of claim 55 wherein the step of determining a threshold comprises performing a curve analysis.
- 57. The product of claim 56 wherein the step of performing a curve analysis comprises determining a point where the e-value curves drops abruptly or flattens.
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] Referenced-Applications
[0002] This application claims the priority of U.S. Provisional Application No. 60/285,144, filed on Apr. 19, 2001 and No. 60/285,403, filed on Apr. 20, 2001. The Nos. 60/285,144 and 60/285,403 applications are incorporated herein by reference for all purposes.
Provisional Applications (2)
|
Number |
Date |
Country |
|
60285144 |
Apr 2001 |
US |
|
60285403 |
Apr 2001 |
US |