Claims
- 1. A method for predicting a polyadenylation site comprising:
inputting a plurality of RNA transcript sequences or sequences dervied from RNA transcript sequences, wherein at least one sequence has its poly A or poly T tract sequence; searching for a polyadenylation site, wherein the polyadenylation is an adenine rich region at the end of the sequence or a thymine rich region at the beginning of the sequence; detecting the presence of polyadenylation signals neigboring the polyadenylation site by scanning the EST or RNA sequences or their corresponding genomic DNA sequences.
- 2. The method of claim 1 wherein the step of searching for a polyadenylation site comprising scanning the sequences for adenine rich region at the end of the sequence or a thymine rich region at the begining of the sequence.
- 3. The method of claim 2 wherein the adenine rich region comprises adenine in at least 50% of the region and the thymine rich region comprises thymine in at least 50% of the region.
- 4. The method of claim 2 wherein the adenine rich region comprises adenine in at least 60% of the region and the thymine rich region comprises thymine in at least 60% of the region.
- 5. The method of claim 2 wherein the adenine rich region comprises adenine in at least 70% of the region and the thymine rich region comprises thymine in at least 70% of the region.
- 6. The method of claim 2 wherein the adenine rich region comprises adenine in at least 80% of the region and the thymine rich region comprises thymine in at least 80% of the region.
- 7. The method of claim 1 wherein a heuristic score nA/(nA+0.5*(max(nR−20,0))) is used for detecting adenine or thymine rich region; wherein nA is the number of adenines or thymines in the block, and nR is the number of bases after the block of adenines or thymine to the end of the sequence.
- 8. A method for detecting polyadenylation signal in a sequence with a polyadenylation site comprising searching for a polyadenylation signal hexamer in the sequence before the polyadenylation.
- 9. The method of claim 8 wherein the searching comprises evaluating the probability that there is a polyadenylation site: Pr(h=k|x) for k=6,7, . . . ,N, wherein the sequence before the polyadenylation site is x=(x1,x2, . . . xN) and where xN is the 3′-most base before the polyadenylation site.
- 10. The method of claim 9 wherein: Pr(h=k|x)=Pr(x|h=k) Pr(h=k) Pr(x).
- 11. The method of claim 10 wherein Pr(h=k|x)=Pr(xk−5, . . . ,xk|h=k) Pr(h=k)/Pr(xk−5, . . . ,xk) and wherein Pr(h=k) is the probability that the polyadenylation hexamer is located at position k in the sequence, at a distance (N−k) from the polyadenylation site, Pr(xk−5, . . . ,xk|h=k) is the probability of observing the hexamer (xk−5, . . . ,xk) given that it is a polyadenylation signal and Pr(xk−5, . . . ,xk|h≠k) is the probability of observing the hexamer given that it is not from a polyadenylation signal.
- 12. The method of claim 11 wherein the step of detecting comprises using a gamma function to produce a density which places the majority of its weight on the positions located 5 to 25 bases distant from the polyadenylation site.
- 13. The method of claim 12 wherein Pr(xk−5, . . . ,xk|h≠k), the probability of observing the hexamer given that it is not from a polyadenylation signal, is modeled using a second-order Markov model trained on data collected from human 3′ UTRs.
- 14. The method of claim 13 wherein Pr(xk−5, . . . ,xk|h≠k)=Pr(xk−5) Pr(xk−4|xk−5) Pr(xk−3|xk−5, xk−4) Pr(xk−2|xk−4,xk−3) Pr(xk−1|xk−3,xk−2) Pr(xk|xk−2,xk−1), wherein the first term is zero-order Markovian probability, the second is a first-order Markovian probability and the remaining four terms are second-order Markovian probabilities.
- 15. The method of claim 14 wherein, for a kth-order Markov model, the probability of base b following a word w of length k is estimated by the frequency of the concatenated word (wb) divided by the frequency of the word w, where frequencies are computed from the training dataset of 3′UTR sequences.
- 16. The method of claim 15 wherein, for the case k=0 (a zero-order Markovian model), the probability of base b is estimated by its frequency in the dataset divided by the size of the dataset.
- 17. A computer readable medium comprising computer-executable instructions for performing the method comprising:
inputting a plurality of RNA transcript sequences or sequences dervied from RNA transcript sequences, wherein at least one sequence has its poly A or poly T tract sequence; searching for a polyadenylation site, wherein the polyadenylation is an adenine rich region at the end of the sequence or a thymine rich region at the beginning of the sequence; detecting the presence of polyadenylation signals neigboring the polyadenylation site by scanning the EST or RNA sequences or their corresponding genomic DNA sequences.
- 18. The computer readable medium of claim 17 wherein the step of searching for a polyadenylation site comprising scanning the sequences for adenine rich region at the end of the sequence or a thymine rich region at the begining of the sequence.
- 19. The computer readable medium of claim 18 wherein the adenine rich region comprises adenine in at least 50% of the region and the thymine rich region comprises thymine in at least 50% of the region.
- 20. The computer readable medium of claim 19 wherein the adenine rich region comprises adenine in at least 60% of the region and the thymine rich region comprises thymine in at least 60% of the region.
- 21. The computer readable medium of claim 20 wherein the adenine rich region comprises adenine in at least 70% of the region and the thymine rich region comprises thymine in at least 70% of the region.
- 22. The computer readable medium of claim 21 wherein the adenine rich region comprises adenine in at least 80% of the region and the thymine rich region comprises thymine in at least 80% of the region.
- 23. The computer readable medium of claim 17 wherein a heuristic score nA/(nA+0.5*(max(nR−20,0))) is used for detecting adenine or thymine rich region; wherein nA is the number of adenines or thymines in the block, and nR is the number of bases after the block of adenines or thymine to the end of the sequence.
- 24. A computer readable medium comprising computer-executable instructions for performing the method comprising: searching for a polyadenylation signal hexamer in the sequence before the polyadenylation.
- 25. The computer readable medium of claim 24 wherein the searching comprises evaluating the probability that there is a polyadenylation site: Pr(h=k|x) for k=6,7, . . . ,N, wherein the sequence before the polyadenylation site is x=(x1,x2, . . .xN) and where xN is the 3′-most base before the polyadenylation site.
- 26. The computer readable medium of claim 25 wherein: Pr(h=k|x)=Pr(x|h=k) Pr(h=k)/Pr(x).
- 27. The computer readable medium of claim 26 wherein: Pr(h=k|x)=Pr(xk−5, . . . ,xk|h=k) Pr(h=k)/Pr(xk−5, . . . ,xk) and wherein Pr(h=k) is the probability that the polyadenylation hexamer is located at position k in the sequence, at a distance (N−k) from the polyadenylation site, Pr(xk−5, . . . ,xk|h=k) is the probability of observing the hexamer (xk−5, . . . ,xk) given that it is a polyadenylation signal and Pr(xk−5, . . . ,xk|h≠k) is the probability of observing the hexamer given that it is not from a polyadenylation signal.
- 28. The computer readable medium of claim 27 wherein the step of detecting comprises using a gamma function to produce a density which places the majority of its weight on the positions located 5 to 25 bases distant from the polyadenylation site.
- 29. The computer readable medium of claim 28 wherein Pr(xk−5, . . . ,xk|h≠k), the probability of observing the hexamer given that it is not from a polyadenylation signal, is modeled using a second-order Markov model trained on data collected from human 3′ UTRs.
- 30. The computer readable medium of claim 29 wherein Pr(xk−5, . . . ,xk|h≠k)=Pr(xk−5) Pr(xk−4|xk−5) Pr(xk−3|xk−5,xk−4) Pr(xk−2|xk−4,xk−3) Pr(xk−1|xk−3,xk−2) Pr(xk|xk−2,xk−1), wherein the first term is a zero-order Markovian probability, the second is a first-order Markovian probability and the remaining four terms are second-order Markovian probabilities.
- 31. The computer readable medium of claim 30 wherein, for a kth-order Markov model, the probability of base b following a word w of length k is estimated by the frequency of the concatenated word (wb) divided by the frequency of the word w, where frequencies are computed from the training dataset of 3′UTR sequences.
- 32. The computer readable medium of claim 31 wherein, for the case k=0 (a zero-order Markovian model), the probability of base b is estimated by its frequency in the dataset divided by the size of the dataset.
- 33. A system comprising: a processor; and a memory coupled with the processor, the memory storing a plurality of machine instructions that cause the processor to perform logical steps of the method comprising:
inputting a plurality of RNA transcript sequences or sequences dervied from RNA transcript sequences, wherein at least one sequence has its poly A or poly T tract sequence; searching for a polyadenylation site, wherein the polyadenylation is an adenine rich region at the end of the sequence or a thymine rich region at the beginning of the sequence; detecting the presence of polyadenylation signals neigboring the polyadenylation site by scanning the EST or RNA sequences or their corresponding genomic DNA sequences.
- 34. The system of claim 33 wherein the step of searching for a polyadenylation site comprising scanning the sequences for adenine rich region at the end of the sequence or a thymine rich region at the begining of the sequence.
- 35. The system of claim 34 wherein the adenine rich region comprises adenine in at least 50% of the region and the thymine rich region comprises thymine in at least 50% of the region.
- 36. The system of claim 35 wherein the adenine rich region comprises adenine in at least 60% of the region and the thymine rich region comprises thymine in at least 60% of the region.
- 37. The system of claim 36 wherein the adenine rich region comprises adenine in at least 70% of the region and the thymine rich region comprises thymine in at least 70% of the region.
- 38. The system of claim 37 wherein the adenine rich region comprises adenine in at least 80% of the region and the thymine rich region comprises thymine in at least 80% of the region.
- 39. The system of claim 33 wherein a heuristic score nA/(nA+0.5*(max(nR−20,0))) is used for detecting adenine or thymine rich region; wherein: nA is the number of adenines or thymines in the block, and nR is the number of bases after the block of adenines or thymine to the end of the sequence.
- 40. A system comprising a processor; and a memory coupled with the processor, the memory storing a plurality of machine instructions that cause the processor to perform logical steps of the method for detecting polyadenylation signal in a sequence with a polyadenylation site comprising: searching for a polyadenylation signal hexamer in the sequence before the polyadenylation.
- 41. The system of claim 40 wherein the searching comprises evaluating the probability that there is a polyadenylation site: Pr(h=k|x) for k=6,7, . . . ,N, wherein the sequence before the polyadenylation site is x=(x1,x2, . . . xN) and where xN is the 3′-most base before the polyadenylation site.
- 42. The system of claim 41 wherein: Pr(h=k|x)=Pr(x|h=k) Pr(h=k)/Pr(x).
- 43. The system of claim 42 wherein Pr(h=k|x)=Pr(xk−5, . . . ,xkh=k) Pr(h=k)/Pr(xk−5, . . . ,xk) and wherein Pr(h=k) is the probability that the polyadenylation hexamer is located at position k in the sequence, at a distance (N−k) from the polyadenylation site, Pr(xk−5, . . . ,xk|h=k) is the probability of observing the hexamer (xk−5, . . . ,xk) given that it is a polyadenylation signal and Pr(xk−5, . . . ,xk|h≠k) is the probability of observing the hexamer given that it is not from a polyadenylation signal.
- 44. The system of claim 43 wherein the step of detecting comprises using a gamma function to produce a density which places the majority of its weight on the positions located 5 to 25 bases distant from the polyadenylation site.
- 45. The system of claim 44 wherein Pr(xk−5, . . . ,xk|h≠k), the probability of observing the hexamer given that it is not from a polyadenylation signal, is modeled using a second-order Markov model trained on data collected from human 3′ UTRs.
- 46. The system of claim 45 wherein Pr(xk−5, . . . ,xk|h≠k)=Pr(xk−5) Pr(xk−4|xk−5) Pr(xk−3|xk−5,xk−4)Pr(xk−2|xk−4,xk−3) Pr(xk−1|xk−3,xk−2) Pr(xk|xk−2,xk−1), wherein the first term is a zero-order Markovian probability, the second is a first-order Markovian probability and the remaining four terms are second-order Markovian probabilities.
- 47. The system of claim 46 wherein, for a kth-order Markov model, the probability of base b following a word w of length k is estimated by the frequency of the concatenated word (wb) divided by the frequency of the word w, where frequencies are computed from the training dataset of 3′UTR sequences.
- 48. The system of claim 47 wherein, for the case k=0 (a zero-order Markovian model), the probability of base b is estimated by its frequency in the dataset divided by the size of the dataset.
RELATED APPLICATIONS
[0001] This application is related to U.S. patent application Ser. No. 09/721,042, filed on Nov. 21, 2000, entitled “Methods and Computer Software Products for Predicting Nucleic Acid Hybridization Affinity”; U.S. patent application Ser. No. 09/718,295, filed on Nov. 21, 2000, entitled “Methods and Computer Software Products for Selecting Nucleic Acid Probes”; U.S. patent application Ser. No. 09/745,965, filed on Dec. 21, 2000, entitled “Methods For Selecting Nucleic Acid Probes”; U.S. patent application Ser. No. 10/006,174, filed on Dec. 4, 2001, and U.S. patent application Ser. No. ______, attorney Docket No. 3440, filed on Dec. 21, 2001, and U.S. patent application Ser. No. ______, attorney docket number 3441, filed on Dec. 21, 2001. All the cited applications are incorporated herein by reference in their entireties for all purposes.