Claims
- 1. A method of analyzing a nucleic acid sequence comprising:
constructing a CFD, thereby analyzing a nucleic acid sequence.
- 2. A method of identifying a CFD component associated with a property of a nucleic acid sequence or a peptide encoded by the nucleic acid, comprising:
optionally, providing CFDs for a training set of nucleic acid sequences; identifying one or more components of the CFDs; identifying a component, the presence, value, or contribution of which, is correlated, negatively or positively, with a property of the nucleic acid or the peptide encoded by a nucleic acid, thereby identifying a CFD component associated with a property of a nucleic acid sequence or a peptide encoded by the nucleic acid.
- 3. A method of analyzing a nucleic acid sequence, comprising:
providing a CFD for the nucleic acid sequence; identifying one or more components of the CFD; determing if a preselected component, known to be associated with a property of the nucleic acid sequence or a peptide encoded by the nucleic acid, is present, thereby analyzing the nucleic acid sequence.
- 4. A method of comparing nucleic acid sequences, comprising:
representing a nucleic acid sequence by a mathematical function of the entire sequence context, that depends on the collective characteristics or attributes of sequence type, order and composition, (a CFD); and comparing CFD's of two or more different, but perfectly matched, duplex sequences by providing a quantitative measurement of similarity between their CFDs.
- 5. The method of claim 4, wherein the method further includes comparing the CFD(s) of one (or more) hybrid duplexes comprised of two strands, whose sequences are not perfectly complementary, with the CFD(s) of the prefect duplexes comprised of one of each strand of the hybrid duplex and its perfect complementary strand.
- 6. The method of claim 5, wherein the method further the following steps:
calculating the CFD's for all duplexes under consideration; recording the CFD for each pair of strands in each prefect duplex under consideration.
- 7. The method of claim 5, wherein the quantitative similarity of the shapes of the reference CFD's and CFD's constructed for pairs of strands from different perfect duplexes provides a quantitative indication of the propensity for cross hybridization of the imperfect matched strands, which is useful where various pairs of strands are simultaneously present in a solution as is the case in a multiplex environment.
- 8. The method of claim 5, wherein the method further includes predicting both the transition temperature and cross-hybridization of duplex sequences from the CFD, and includes the following steps:
providing a set of duplex DNA molecules; providing the melting temperature of each duplex; measuring the cross-hybridization behavior of the set of duplexes; calculating the CFD's for the perfect duplex molecules of the set and of all the other combinations of strands and recording them, to provide a training set for an artificial intelligence algorithm; simplifying the CFD input by finding the basis CFD's for the set which are the minimal number of CFD's that can be combined to produce the entire set of CFD's; relating the coefficients of each sequence with the observed transition temperature and cross-hybridization propensity; and predicting the transition temperature and cross hybridization propensity for any new sequence from the coefficients of the basis CFD's for that sequence.
- 9. The method of claim 5, wherein the method is applied to predict the shape of the CFD from the desired transition temperature and cross hybridization propensity comprised of the following steps:
providing preparing a set of duplex DNA molecules; providing the melting temperature of each duplex; determining the cross-hybridization behavior of the set of duplexes; calculating the CFD's for the perfect duplex molecules of the set and of all the other combinations of strands and recording them to provide a training set for an artificial intelligence algorithm; simplifying the CFD input by finding the basis CFD's for the set which are the minimal number of CFD's that can be combined to produce the entire set of CFD's. (For example, if three basis CFD's are found then the shape of the CFD for each pair of sequences can be represented by three numbers (coefficients) instead of an entire CFD); training a neural network or using regression analysis to relate the observed transition temperature and cross-hybridization propensity with the coefficients representative of the CFD of each sequence; optimizing the neural network or regression by interactive adjustment using algorithms; calculating the predicted CFD from the desired transition temperature and cross hybridization propensity; feeding the desired Tm and cross-hybridization propensity into the trained network which provides the coefficients of the CFD; and calculating the correponding CFD for the sequences with the desired Tm and cross-hybridization propensity.
- 11. The method of claim 5, wherein the method is applied to scanning of a nucleic acid, e.g., a gene or genome, and finding sequences with most similar and dissimilar segments and includes the following steps:
for analysis of a gene sequence (one strand) define the desired length, N, for a probe (primer or marker) to be compared to the gene sequence; starting at the first base of the genome, calculate the CFD for the N base pair duplex from position 1 to position N, continuing the process moving over every N base pair sequence until the last n base pair duplex of the genome is considered; and calculating the correlation coefficients for all combinations of perfect match duplex CFD's, recording the results as elements, rij, of a correlation matrix.
- 12. The method of claim 5, wherein the method determines the cross-hybridization propensity for a set of probes, e.g., all probes of a genome or a selected subset of dissimilar probes using a predefined threshold value of rij including the following steps;
provide all possible combinations of probe strands in duplexes; provide the CFD's of all possible combinations; after aligning each pair of CFD's at their minima, calculate the correlation coefficients of each pair of CFD's and assemble the correlation matrix.
- 13. The method of claim 5, wherein the method is used to scan a nucleic acid, e.g., a gene or genome sequence, for optimal regions for micro-array applications comprising the following steps.
define the Tm at which the micro array will be operated; define the desired threshold for cross hybridization propensity; define the length of the probes for the microarray; using a trained neural network predict the coefficients of the basis CFD's from the desired Tm and cross-hybridization propensity; use the basis CFD's and coefficients to generate the predicted CFD matching the desired Tm and cross-hybridization propensity; examine all sequences of the desired length and provide their CFD's; determine quantitative similarity between calculated and predicted CFD'S; label each position by its corresponding correlation coefficient; define a threshold of similarity by the value of the correlation coefficient, for example rij>0.7. thereby providing sections of the gene above this threshold and having the desired Tm and cross-hybridization propensity.
- 14. The method of claim 5, wherein the method is used to design and generate probe sequences for use in a universal sequence microarray comprising the following steps.
(a) generating an Eulerian graph, describing a plurality of nucleic acid sequences; (b) partitioning the nucleic acid sequences according to a given composition; (c) creating subgraphs that specify how many and what type of the monomeric basis comprise the sequences wherein the subgraphs have vertices that correspond to the types of oligomeric sequences and edges that correspond to partitioning of the integers that describe properties of the sequences; (d) characterizing the sequences by their propensity for cross-hybridization by (i) formulating the context functional descriptor of each sequence aligned with itself as a nucleic acid duplex at each alignment position and (ii) assigning a number representing the relative thermodynamic stability of the duplex, thereby generating diagonal elements of a correlation matrix; and (e) aligning the deepest minima of off-diagonal elements of the correlation matrix with the deepest minima of the diagonal elements of the correlation matrix, thereby analyzing the potential interactions between the nucleic acid sequences.
- 15. The method of claim 5, wherein the method analyzes the potential interactions between nucleic acid sequences, e.g., sequences described herein, wherein the subgraphs generated in step (c) are listed in a relative manner according a desired property.
- 16. A method for analyzing a population of nucleic acid sequences comprising:
providing a population of nucleic acid sequences; providing a CFD for each nucleic acid sequence and each nucleic sequence of a selected group of complements of the nucleic acids of the population; comparing the CFD for each nucleic acid sequence and its perfect complement with each of the CFD's for the same nucleic acid and each nucleic sequence of a selected group of complements of the nucleic acids of the population; thereby analyzing a population of nucleic acid sequences, e.g., for selecting a subset of the population having a selected degree of cross-hybridization or non cross-hybridization.
- 17. The method of claim 16, wherein the calculation of CFD includes accounting for loop structures inferred from mismatches.
- 18. The method of claim 16, wherein the parameter can include one or more of a thermodynamic value.
- 19. The method of claim 16, wherein the comparing step can include aligning the CFD data by a selected characteristic of a curve of values from the CFD.
- 20. The method of claim 16, wherein the comparison can include calculating a matrix of n sequences, wherein the matrix is a, b, c×a′, b′, c′, and the values in the matrix represent the CFD for a given duplex.
- 21. A method of providing a population of nucleic acid sequences comprising:
a) providing a value for the length of a nucleic acid; b) providing values for the base composition; c) providing a Eulerian representation, of possible sequences which representation can be described by Eulerian graph, d) extracting sequences from the representation, to thereby provide a population of nucleic acid sequences.
- 22. The method of claim 21, wherein the Eulerian representation can be an n×n matrix, wherein n is equal to the number of bases used.
- 23. The method of claim 21 wherein extracting the sequence can include decomposing the Eulerian representation into components and permuting the components to produce the population of sequences.
- 24. A method of providing a population of nucleic acid sequences comprising:
a) providing a value for the length of a nucleic acid; b) providing values for the base composition; c) providing a representation, sometimes referred to herein as a Eulerian representation, of possible sequences which representation can be described by Eulerian graph; d) repeating steps a, b, and c, at least one time; e) extracting sequences from the representations, to thereby provide a population of nucleic acid sequences.
- 25. The method of claim 24, wherein the representation can be an n×n matrix, wherein n is equal to the number of bases used.
- 26. A method for analyzing nucleic acid sequences comprising the steps of:
(a) generating an Eulerian graph, or representation thereof, describing a plurality of nucleic acid sequences; (b) optionally, partitioning the nucleic acid sequences according to a given composition; (c) creating subgraphs that specify how many and what type of the monomeric basis comprise the sequences wherein the subgraphs have vertices that correspond to the types of oligomeric sequences and edges that correspond to partitioning of the integers that describe properties of the sequences; (d) characterizing the sequences by their propensity for cross-hybridization by (i) formulating the context functional descriptor of each sequence aligned with itself as a nucleic acid duplex at each alignment position and (ii) assigning a number representing the relative thermodynamic stability of the duplex, thereby generating diagonal elements of a correlation matrix; (e) characterizing the sequences by their propensity for hybridization by (i) formulating the context functional descriptor of each sequence aligned with every other sequence as a nucleic acid duplex at each alignment position and (ii) assigning a number representing the relative thermodynamic stability of the duplex, thereby generating off-diagonal elements of the correlation matrix; and (f) aligning the deepest minima of off-diagonal elements of the correlation matrix with the deepest minima of the diagonal elements of the correlation matrix, thereby analyzing the potential interactions between the nucleic acid sequences.
- 27. A method of and identifying a population of sequences comprising:
providing an initial population of nucleic acid sequences, e.g., cDNA's; providing, for a first nucleic acid sequence of the population, a selected set of oligomers derived from the first nucleic acid; providing, for a second and optionally subsequent nucleic acid sequence of the population, a selected set of oligomers derived from the second or subsequent nucleic acid; providing a Tm, for oligomers produced above and its perfect compliment; selecting subpopulations of the oligomers for which a Tm is provided into a plurality subpopulations each having a preselected range of values for Tm, thus providing a subpopulation which has a selected property.
- 28. A method for analyzing a nucleic acid sequence, to determine the A Tm involved with introducing a change comprising:
providing a nucleic acid sequence A and providing a first CFD for the perfect duplex, A, A′; providing a nucleic acid sequence B′ which is the complement of B and where B differs from A by a change; providing a second CFD for the imperfect duplex, A, B′; comparing the first and second CFD's, providing a correlation coefficient providing a value for Tm, for the perfect duplex A, A′; determining a value for the parameter for the imperfect duplex A, B′ by dividing the Tm of A, B′ by the correlation coefficient, thereby analyzing a nucleic acid sequence.
- 29. The method of claim 28, wherein the change is a change at a single nucleotide giving a single nucleotide mismatch.
- 30. A computer readable file, having a record which includes an element which identifies a nucleic acid, and an element which describes the CFD or on or more components thereof.
- 31. The file of claim 30, wherein the record includes an element which identifies a property of the nucleic acid or the peptide it encodes.
- 32. The file of claim 30, wherein the file includes records for a plurality of nucleic acids.
- 33. A method of analyzing a nucleic acid sequence comprising: providing a Eulerian representation of a population of sequences, wherein the population includes at least 105 sequences; searching the population for a sequence of interest or comparing a reference sequence with a sequence in the population.
- 34. A set of nucleic acids, made or compiled by a method described herein.
- 35. The set of nucleic acids of claim 34, wherein it is an ordered array.
RELATED APPLICATIONS
[0001] This application claims priority to U.S. provisional application No. 60/274,598 filed on Mar. 10, 2001, the contents of which are incorporated herein by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60274598 |
Mar 2001 |
US |