The present invention relates to a set S of oligonucleotide sequences wherein the set S is designed to prevent mishybridization between any sequence in the set S of oligonucleotide sequences of a certain fixed length and other sequences in the set S including the overlap parts of their concatenated sequences by guaranteeing a certain amount of mismatches, a systematic method for designing the above mentioned set of sequences, a systematic method for designing a GC template or an AG template used for designing the above set S of oligonucleotide sequences, DNA or RNA chips, DNA or RNA tags, DNA or RNA computing systems, and DNA or RNA probes utilizing the set S of oligonucleotide sequences.
DNAs have a structure wherein four types of base, that is, adenine (A), cytosine (C), guanine (G) and thymine (T), are ligated together like a strand. Since A and T, and C and G form a base pair by hydrogen bond respectively, A-T and C-G are considered to be complementary. Two DNA strands have a complementary double helix structure, and the DNA double helix is separated into single-stranded DNAs when temperature rises, and the single-stranded DNA binds to a complementary strand again when temperature drops. This process of binding to a complementary strand is called hybridization, and it is known that the temperature at which DNA strands separate or hybridize depends on GC content in the sequence.
It is pointed out that there is a problem of interaction between primers in designing two types of primers which are indispensable to conduct PCR (polymerase chain reaction), a very useful gene amplification method and an essential technique in wide range of biology-related studies. As the concentration of primers in PCR reaction liquid is higher than that of a target gene by far, if the primers have a structure prone to hybridize each other, mishybridization will occur between sense strands, antisense strands, or sense strand and antisense strand, and so-called primer dimers are formed, with the result that hybridization with the target gene will be drastically suppressed.
Further, in so-called DNA computing that comprises the steps of; synthesizing DNA sequences in which various symbol manipulation operating systems such as logical expression and graph structure are recorded, and cutting and pasting the sequences according to protocols of biological experiments, DNAs as basic parts are synthesized in the number that corresponds to the size of problems, and the problems are solved by a very simple “generate and test” method (Science 266, 1021-1024, 1994). In other words, DNA computing can be carried out by generating DNA sequences randomly in an amount sufficient to cover solution space by connecting parts randomly, and by extracting only solutions that meet a certain requirement from the numerous combinations of the randomly generated sequences. For example, digestion by restriction enzyme can be used for the extraction of the solution mentioned above, and parts are designed so that sequences of incorrect solution contain a recognition site of a restriction enzyme and that sequences of correct solution do not contain a recognition site of a restriction enzyme. DNA memory wherein 5′-end of DNA is fixed on a solid phase is known as an application of such DNA computing model (Nature 403, 175-179, 2000), and a method for searching solutions by generating various combinations of sequences randomly and fixing them on a solid phase and serially cutting out inappropriate sequences from them, is used. In that method, restriction enzymes are used for cutting out sequences on the solid phase, and polymerase is used for extension. In case of this DNA memory, attention should be directed to prevent mishybridization between DNA sequences.
It is also known to design DNAs wherein mishybridization does not occur between DNA sequences in the above-mentioned primer designing, DNA computing, etc. For instance, a programmed computer system comprising means for designing oligonucleotide sequences based on the GenBank database of DNA and mRNA sequences and performing correct and incorrect match modeling with user-selected gene sequences, and means for performing hybridization strength modeling on gene sequences, etc (Published Japanese Translation of PCT International Publication for Patent Application No. 8-503091); a method for DNA computing by computer using genetic algorithm wherein shift-errors are prevented or minimized in consideration of the Hamming distance in frame shift-error hybridization process in which DNA sequences of fixed length are shifted each other (“A New Metric for DNA Computing” Proceedings of the 2nd Annual Genetic Programming Conference, Palo Alto, 472-478, 1997); a method for DNA computing by computer wherein the method is imposed a condition that subsequences of specific length in DNA sequences of fixed length do not appear more than once in designed DNA sequence sets of fixed length (European Patent Application No. 97302313, U.S. Pat. No. 5,604,097), are also reported.
DNA computing is a study field wherein computing of combinatorial mathematics, logic, etc. is conducted by biological experiments as mentioned above. Specifically, it is computing that comprises the steps of; artificially synthesizing DNA sequences in which various symbol manipulation operating systems such as logical expression and graph structure are recorded, and cutting and pasting the sequences according to protocols of molecular biological experiments, and sequences obtained at the end of the experiments are “calculation results” of DNA computing. Thus, demand of a technique used by encoding information that has artificially created meaning (for example, logical parameter, mathematics, etc.) onto DNA base is thought to be increasing acceleratedly with the progress of biotechnology. In order to make the technique work well, it is indispensable to design DNA sequences skillfully in advance to avoid misinterpretation caused by errors. For instance, in case symbol x is expressed as four bases of ACAC, a string xx would be ACACACAC, and a base sequence of x appears in the joint part, and it causes errors. In order to prevent this, there is a need for a method for systematically and efficiently searching a set of sequences, wherein any sequence contains ligating sites to other sequence or between sequences, and a certain amount of mismatches is guaranteed.
As aforementioned, though methods for designing sequences wherein oligonucleotide sequences are constructed such that oligonucleotide sequences such as DNA sequences induce mismatches and can avoid mishybridization with each other are known, these methods are aimed at design of oligonucleotide such as DNA sequences to be fixed on a solid phase, and therefore, sequences that contain shift and ligation in oligonucleotide sequences to avoid mishybridization are not designed. For instance, a method for designing sequences that ensures that mishybridization is avoided even if DNA sequences are in a liquid phase or sequences are ligated each other, has not been reported so far. Further, conventional sequence design that avoids mishybridization is DNA computing by a computer using genetic algorithm, or a very simple “generate and test” method or a modified method thereof, and these DNA computing methods are not regarded as systematic computing methods.
The object of the present invention is to provide a method for systematically designing a set S of oligonucleotide sequences of predetermined length n (n is an integer, 3 or more, preferably, 6 or more), wherein each of oligonucleotide sequences in the set S induces equal to or more than a fixed, predetermined number of mismatches against any of oligonucleotide sequences in the set S, a complementary sequence of each of oligonucleotide sequences in the set S, sequences constructed by shifting these sequences, and sequences produced by ligation of these oligonucleotide sequences, of their complementary sequences, and of the oligonucleotide sequences and their complementary sequences. The set S of oligonucleotide sequences can avoid mishybridization between them, their complementary sequences, sequences constructed by shifting these sequences, and sequences produced by ligation of the oligonucleotide sequences, of their complementary sequences, and of the oligonucleotide sequences and their complementary sequences. In addition, the object of the present invention is to provide a method for systematically designing a set S of oligonucleotide sequences wherein mishybridization can be avoided for reverse sequences as well as for complementary sequences. Meanwhile, “to be able to avoid mishybridization between oligonucleotide sequences by inducing mismatches of predetermined value or more” is hereinafter sometimes referred to as “to be orthogonal”, and “a sequence that is orthogonal” is hereinafter sometimes referred to as “an orthogonal sequence”.
The present inventor has conducted intensive study for a systematic sequence design method for orthogonal sequences including shift and ligation, which is an important technique for obtaining correct experimental results in DNA computing and biotechnology in future, and has found that a set S of orthogonal oligonucleotide sequences that ensures a mishybridization value including shift and ligation by: 1) selecting a binary string (GC template) such that its Hamming distance to its reverse sequence, to its block shift, and its Hamming distance to the overlap part of its tandem concatenation, its concatenation with its reverse sequence, and the tandem concatenation of its reverse sequence, is equal to or above the predetermined value k, and in the following, an oligonucleotide sequence of length n is specified by the binary string of 0 and 1 (GC template) of predetermined length L (L is an integer 6 or more), meaning that the positions of G or C ([GC]), or A or T ([AT]) are fixed; 2) combining the codewords of any error correcting code with the selected GC template to specify a set of oligonucleotide sequences that induce at least k mismatches between any of them. The present invention has been thus completed.
The present invention relates to a set S of oligonucleotide sequences of predetermined length n (n is an integer 3 or more), wherein each of oligonucleotide sequences in the set S induces equal to or more than a fixed, predetermined number of mismatches against any of oligonucleotide sequences in the set S, a complementary sequence of each of oligonucleotide sequences in the set S, sequences constructed by shifting these sequences, and sequences produced by ligation of these oligonucleotide sequences, of their complementary sequences, and of the oligonucleotide sequences and their complementary sequences, and wherein the set S of oligonucleotide sequences can avoid mishybridization between them, their complementary sequences, sequences constructed by shifting these sequences, and sequences produced by ligation of the oligonucleotide sequences, of their complementary sequences, and of the oligonucleotide sequences and their complementary sequences (“1”), a set S of oligonucleotide sequences of predetermined length n (n is an integer 3 or more), wherein each of oligonucleotide sequences in the set S induces equal to or more than a fixed, predetermined number of mismatches against any of oligonucleotide sequences in the set S, a reverse sequence of each of oligonucleotide sequences in the set S, sequences constructed by shifting these sequences, and sequences produced by ligation of the oligonucleotide sequences, of their reverse sequences, and of the oligonucleotide sequences and their reverse sequences, and wherein the set S of oligonucleotide sequences can avoid mishybridization between them, their reverse sequences, sequences constructed by shifting these sequences, and sequences produced by ligation of the oligonucleotide sequences, of their reverse sequences, and of the oligonucleotide sequences and their reverse sequences (“2”), the set S of oligonucleotide sequences according to “1” or “2”, which comprises oligonucleotide sequences of predetermined length n (n is an integer 6 or more) (“3”), the set S of oligonucleotide sequences according to any one of “1” to “3”, wherein the set S of oligonucleotide sequences of predetermined length n is a set S of oligonucleotide sequences of length 32 or less (“4”), the set S of oligonucleotide sequences according to any one of “1” to “4”, wherein the predetermined number of mismatches is equal to or more than one-fourth of the sequence length n (“5”), the set S of oligonucleotide sequences according to any one of “1” 1 to “5”, wherein the set S of oligonucleotide sequences is a set of oligonucleotide sequences that contains or never contains a particular subsequence (“6”), and the set S of oligonucleotide sequences according to “6”, wherein the particular subsequence is a restriction site (“7”).
The present invention also relates to a method for designing the set S of oligonucleotide sequences according to “3”, comprising the following steps: 1) Select a binary string (GC template) such that its Hamming distance to its reverse sequence, to its block shift, and its Hamming distance to the overlap part of its tandem concatenation, its concatenation with its reverse sequence, and the tandem concatenation of its reverse sequence, is equal to or above the predetermined value k, and in the following, an oligonucleotide sequence of length n is specified by the binary string of 0 and 1 (GC template) of predetermined length L (L is an integer 6 or more), meaning that the positions of G or C ([GC]), or A or T ([AT]) are fixed; 2) Combine the codewords of any error correcting code with the selected GC template to specify a set of oligonucleotide sequences that induce at least k mismatches between any of them (“8”), a method for designing the set S of oligonucleotide sequences according to “1” or “2”, comprising the following steps: 1) Select a binary string (AG template) such that its Hamming distance to its reverse inverted sequence, to its block shift, and its Hamming distance to the overlap part of its tandem concatenation, its concatenation with its reverse inverted sequence, and the tandem concatenation of its reverse inverted sequence, is equal to or above the predetermined value k, and in the following, an oligonucleotide sequence of length n is specified by the binary string of 0 and 1 (AG template) of predetermined length L (L is an integer 6 or more), meaning that the positions of A or G ([AG]), or C or T ([CT]) are fixed; 2) Combine the codewords of any error correcting constant-weight code with the selected AG template to specify a set of oligonucleotide sequences that induce at least k mismatches between any of them (“9 ”), the method for designing a set S of oligonucleotide sequences according to “8” or “9”, wherein any of these oligonucleotide sequences, of which Hamming distance is equal to or above k, induces at least k mismatches against any of the sequences in the set S, their complementary sequences, sequences constructed by shifting these sequences, and the sequences produced by ligation of sequences in the set S, of their complementary sequences, and of the sequences and their complementary sequences, and wherein the sequences in the set S can avoid mishybridization between them, their complementary sequences, or sequences constructed by shifting these sequences, and sequences produced by ligation of sequences in the set S, of their complementary sequences, and of the sequences and their complementary sequences (“10”), the method for designing a set S of oligonucleotide sequences according to “8” or “9”, wherein any of these oligonucleotide sequences, of which Hamming distance is equal to or above k, induces at least k mismatches against any of the sequences in the set S, their reverse sequences, sequences constructed by shifting these sequences, and the sequences produced by ligation of sequences in the set S, of their reverse sequences, and of the sequences and their reverse sequences, and wherein the sequences in the set S can avoid mishybridization between them, their reverse sequences, or sequences constructed by shifting these sequences, and sequences produced by ligation of sequences in the set S, of their reverse sequences, and of the sequences and their reverse sequences (“11”), the method for designing a set S of oligonucleotide sequences according to any one of “7” to “9”, wherein the set S of oligonucleotide sequences of predetermined length n is a set S of oligonucleotide sequences of length 32 or less (“12”), the method for designing a set S of oligonucleotide sequences according to any one of “8” to “12”, wherein the predetermined value k is one-fourth of L or more (“13”), the method for designing a set S of oligonucleotide sequences according to any one of “8” to “13”, wherein the set S of oligonucleotide sequences is a set of oligonucleotide sequences that contains or never contains a particular subsequence (“14”), the method for designing a set S of oligonucleotide sequences according to “14”, wherein the particular subsequence is a restriction site (“15”), and the method for designing a set S of oligonucleotide sequences according to any one of “8” to “15”, wherein the codewords of an error correcting code are selected from Hamming codes, BCH codes, maximum-length codes, Golay codes, Reed-Muller codes, Reed-Solomon codes, Hadamard codes, Preparata codes, reversible codes, or constant-weight codes (“16”).
The present invention further relates to a method for designing a GC template used for constructing the set S of oligonucleotide sequences according to “3”, by selecting a GC template so that its Hamming distance to its reverse sequence, to its block shift, and its Hamming distance to the overlap part of its tandem concatenation, its concatenation with its reverse sequence, and the tandem concatenation of its reverse sequence, is equal to or above the predetermined value k, and wherein an oligonucleotide sequence of length n is specified by the binary string of 0 and 1 (GC template) of predetermined length L (L is an integer 6 or more), meaning that the positions of G or C ([GC]), or A or T ([AT]) are fixed (“17”), the method for designing a GC template according to “17”, wherein the GC template of predetermined length L is a GC template of length 32 or less (“18”), the method for designing a GC template according to “17” or “18”, wherein the predetermined value k is one-fourth of L or more (“19”), the method for designing a GC template according to “18”, wherein the GC template shows 2, 4, 6, 7, 8, 9, 10, 11, or 12, as the predetermined value k, when the length L of the GC template is 6 to 10, 11 to 15, 16 to 18, 19, 20 to 22 and 24, 23 and 25, 26 and 27, 28 and 29, or 30 to 32, respectively (“20”), the method for designing a GC template according to any one of “17” to “20”, wherein the set S of oligonucleotide sequences is a set of oligonucleotide sequences that contains or never contains a particular subsequence (“21”), and the method for designing a GC template according to “21”, wherein the particular subsequence is a restriction site (“22”).
The present invention also relates to a method for designing an AG template used for constructing the set S of oligonucleotide sequences according to “1” or “2”, by selecting an AG template so that its Hamming distance to its reverse inverted sequence, to its block shift, and its Hamming distance to the overlap part of its tandem concatenation, its concatenation with its reverse inverted sequence, and the tandem concatenation of its reverse inverted sequence, is equal to or above the predetermined value k, and wherein an oligonucleotide sequence of length n is specified by the binary string of 0 and 1 (AG template) of predetermined length L (L is an integer 6 or more), meaning that the positions of A or G ([AG]), or C or T ([CT]) are fixed (“23”), the method for designing an AG template according to “23”, wherein the AG template of predetermined length L is an AG template of length 32 or less (“24”), the method for designing an AG template according to “23” or “24”, wherein the predetermined value k is one-fourth of L or more (“25”), the method for designing an AG template according to “23”, wherein the AG template shows 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, or 13, as the predetermined value k, when the length L of the AG template is 3 to 5, 6 to 8, 9, 10 to 12, 13 and 14, 15 to 18, 19, 20 to 22, 23, 24 to 26, 27, 28 to 30, 31, or 32, respectively (“26”), the method for designing an AG template according to any one of “23” to “26”, wherein the set S of oligonucleotide sequences is a set of oligonucleotide sequences that contains or never contains a particular subsequence (“27”), and the method for designing an AG template according to “27”, wherein the particular subsequence is a restriction site (“28”).
The present invention still further relates to DNA or RNA chips which contain the set S of oligonucleotide sequences according to any one of “1” to “7” (“29”), DNA or RNA tags which contain the set S of oligonucleotide sequences according to any one of “1” to “7” (“30”), DNA or RNA computing systems which use the set S of oligonucleotide sequences according to any one of “1” to “7” (“31”), and DNA or RNA probes selected from the set S of oligonucleotide sequences according to any one of “1” to “7”. (“32”).
The set S of oligonucleotide sequences (hereinafter sometimes referred to as “P sequence”) of the present invention is not particularly limited as long as it is a set of orthogonal sequences that comprises a set S of P sequences of predetermined length n (in case of GC templates, n is an integer 6 or more, in case of AG templates, n is an integer 3 or more), wherein each of P sequences in the set S induces equal to or more than a fixed, predetermined number of mismatches against any of P sequences in the set S, a complementary sequence (hereinafter sometimes referred to as “PC sequence”) or reverse sequences (hereinafter sometimes referred to as “PR sequence”) of each of P sequences in the set S, sequences constructed by shifting these sequences, and sequences produced by ligation of these P sequences, of PC sequences or PR sequences, and of the P sequences and PC sequences or PR sequences in the set S. The set S of P sequences can avoid mishybridization between them, PC sequences or PR sequences, sequences constructed by shifting these sequences, and sequences produced by ligation of the P sequences, of their complementary sequences, and of the P sequences and PC sequences or PR sequences in the set S. The above-mentioned oligonucleotide sequences include DNA sequences and RNA sequences. In addition, though the upper limit of the predetermined length n of the oligonucleotide sequences (in case of GC templates, n is an integer 6 or more, in case of AG templates, n is an integer 3 or more) is not particularly defined, it is normally 100 bases, preferably 32 bases in consideration of the use as a primer in PCR or a DNA tip, on the other hand, when the predetermined length is 5 or less (in case of GC templates), or 2 or less (in case of AG templates), the set S of oligonucleotide sequences of the present invention cannot be obtained. The set S of oligonucleotide sequences, which is a target of the present invention, conveniently includes subsets of the set S. Hereinafter, it is described how the set S inducing mismatches is designed with the use of a GC template mainly, focusing the case where the oligonucleotide sequence is a DNA sequence, including the case of complementary sequences.
The P sequences in the set S of the present invention designed by using a GC template not only induce mismatches of predetermined value or more between the sequences themselves, and between the P sequences and other P sequences in the set S, in both cases where sequences are shifted (sequences are staggered) and not shifted, and can avoid mishybridization, but also induce mismatches of predetermined value or more between the P sequences and PC sequences which are complementary sequences of each of other oligonucleotide sequences (excluding the P sequences themselves) in the set S, that is, PC sequences constructed by substituting A, T, G and C in the P sequences with T, A, C and G respectively, and reversing the direction of 5′ and 3′, in both cases where sequences are shifted and not shifted, and can avoid mishybridization, and induce mismatches of predetermined value or more between the P sequences and oligonucleotide sequences constructed by ligating each of oligonucleotide sequences in the set S, that is, ligated sequences of P sequences, and ligated sequences of PC sequences, ligated sequences of P sequences and PC sequences, ligated sequences of PC sequences and P sequences, etc., and can avoid mishybridization. Here, mismatch means a pairing with bases other than complementary bases in hybridization, and as mismatches of predetermined value or more, there is no particular limitation as long as it is the number of mismatches with which mishybridization can be avoided, however, it is preferable that mismatches are one-fifth or more, more preferably one-fourth or more, and most preferably one-third or more of predetermined length n (n is an integer 6 or more) of oligonucleotide sequences.
The P sequences in the set S of the present invention not only induce mismatches of predetermined value or more between the sequences themselves, and between the P sequences and other P sequences in the set S, in both cases where sequences are shifted (sequences are staggered) and not shifted and can avoid mishybridization, but also induce mismatches of predetermined value or more between the P sequences and PR sequences which are reverse sequences of each of P sequences in the set S, that is, sequences (for example, TCAGTTAA) whose 5′ side and 3′ side are 3′ side and 5′ side of 5′→3′ sequences of (for example, AATTGACT) in the P sequences, respectively, in both cases where sequences are shifted and not shifted, and can avoid mishybridization, and indece mismatches of predetermined value or more between ligated sequences of P sequences, and ligated sequences of PR sequences, ligated sequences of P sequences and PR sequences, ligated sequences of PR sequences and P sequences, etc., and can avoid mishybridization.
Further, it is preferable that the oligonucleotide sequences that compose the set S of the present invention can be operated as oligonucleotide sequences that contain or never contain particular subsequences. Examples of the particular subsequences include restriction sites; expression signal sequences including poly A portions of RNA, ATG which is a translation initiation codon, TAA, TAG, TGA, etc. which are stop codons; consensus sequences GCCAATCT, ATGCAAAT, recognized by transcription factors, and optional DNA sequence signal such as base sequences encoding variable regions of antibodies.
The set S of oligonucleotide sequences of the present invention mentioned above can be usually designed in two steps. The first step is a step of designing a GC template with the use of the Hamming distance, and in the next step, the set S of oligonucleotide sequences of the present invention as an object can be designed from the set of oligonucleotide sequences represented by the designed GC template by using the theory of error correcting codes. Since DNA sequences can be sequences comprising G or C [GC], or A or T [AT], it is determined in the first step whether each of the positions of sequences is [GC] or [AT]. This position is represented by a GC template comprising 0 and 1; b1, b2 . . . bi (biε{0, 1}), and 1 and 0 mean [AT] and [GC], respectively, or , 1 and 0 mean [GC] and [AT], respectively. Therefore, not 4L kinds but 2L kinds of sequences are represented by a GC template of length L. In the next step, base sequences are determined by specifically substituting the position 1 of a GC template with bases at [AT], and the position o with bases at [GC], or the position 1 with bases at [GC], and the position 0 with bases at [AT].
The Hamming distance mentioned above is used as a scale for similarity between sequences. For example, the Hamming distance between two strings x=x1, x2, . . . Xn and y=y1, y2, . . . yn is defined as the number of index i that complies with the condition of xi≠yi. In addition, as mishybridization between DNA sequences can be occurred even when sequences are shifted (staggered), it is necessary to consider the Hamming distance in the case where sequences are shifted. Since “shift” occurs when one sequence is longer than the other, in case of |x|<|y|, the Hamming distance between the two strings is made to be the minimum value of the Hamming distance between x and each of (|y|−|x|+1) subsequences of length |x| contained in y. The Hamming distance indicated by this minimum value can be represented by H (x, y).
Next, function MD (abbreviation of min distance) against a GC template t is considered in order to obtain the Hamming distance between a GC template t and ligated sequences of the GC templates t, ligated sequences of reverse sequences tR of the GC templates t, ligated sequences of the GC templates t and reverse sequences tR. The reverse sequence tR means a sequence wherein a binary string of the GC template t is aligned reversely. As the Hamming distance between a GC template t and a GC template t, its reverse sequence tR, which are sequences at both outer sides of ligated sequences, is already obtained, it is suffice to consider sequences wherein one letter each is deleted from both ends of ligated sequences when obtaining minimum value of the Hamming distance by shifting GC templates t against ligated sequences, consequently, it is convenient to use a symbol [ ] in a mathematical formula of MD(t). The meaning of symbol [ ] is: [s1 s2 s3 . . . sm-1 sm]=s2 . . . sm-1, that is, it means a sequence wherein one letter each is deleted from both ends. Therefore, the minimum distance MD(t) of the Hamming distance between GC templates t and ligated sequences is represented by the following formula.
MD(t)=min{H(t, tR), H(t, [tt]), H(t, [ttR]), H(t, [tRt]), H (t, [tRtR])}.
Consequently, in case where MD(t)=k(k≧0) for a GC template t, at least Hamming distance k is ensured for sequences [tt], [ttR], [tRt], [tRtR], including ligating parts thereof, wherein one letter each is deleted from both ends of ligated sequences, when a GC template t is shifted against ligated sequences.
Thus, the method for designing a GC template of the present invention is used at the first step of constructing the set S of oligonucleotide sequences of the present invention. As seen from the above explanation, the method for designing a GC template of the present invention is not particularly limited as long as it is a method comprising selection of GC templates such that its Hamming distance to its reverse sequence, to its block shift, and its Hamming distance to the overlap part of its tandem concatenation, its concatenation with its reverse sequence, and the tandem concatenation of its reverse sequence, is equal to or above the predetermined value k. In the following, an oligonucleotide sequence of length n is specified by the binary string of 0 and 1 (GC template) of predetermined length L (L is an integer 6 or more), meaning that the positions of G or C ([GC]), or A or T ([AT]) are fixed. However, the length L of GC template is 6 or more, preferably 6 to 100, more preferably 6 to 32, most preferably around 20, which is often used in experiments of molecular biology. If the length is 5 or less, the one having desired Hamming distance cannot be obtained. By using the GC template having the length L, a set S of oligonucleotide sequences of corresponding length n can be obtained. Further, the predetermined value k is not particularly limited as long as it is a value that allows oligonucleotide sequences constructed from the GC template to be the oligonucleotide sequences of the present invention that can avoid mishybridization. The value is preferably one-fifth or more, more preferably one-fourth or more, most preferably one-third or more of the length L of the GC template.
In general, when the length L is increased or MD value (k value) is decreased, many more GC templates will exist, however, a GC template of predetermined length and having the greatest k value (MD value) is particularly important. Examples of GC templates of length L=6 to 32 and having the greatest k value (MD value) include; GC templates having length L=6 to 10, 11 to 15, 16 to 18, 19, 20 to 22 and 24, 23 and 25, 26 and 27, 28 and 29, 30 to 32, and the predetermined value k=2, 4, 6, 7, 8, 9, 10, 11, 12, respectively. The maximum value of the predetermined value k in the GC templates of length L=6 to 32, the number of GC templates having the maximum value, and specific examples are shown in [Table 1]. In addition, the shortest GC templates that fulfill specific MD value (k value) are shown in [Table 2]. Further, specific examples for GC templates of length L=11 to 27 and those for GC templates of length L=28 to 30 are shown in [Table 3] and [Table 4], respectively. In [Table 2], GC templates are enumerated excluding the ones that have the same reverse sequences or sequences wherein 0 and 1 are reversed, and in [Table 3] and [Table 4], “items” are the numbers after omitting GC templates that become identical by cyclic shift.
The GC template sequences enumerated in [Table 1] to [Table 4], etc., can be selected by searching exhaustively all patterns from sequences comprising only 0 to sequences comprising only 1, by a person skilled in the art. However, there is no need to search all 2L patterns to find a GC template of length L. It is suffice to take into account the GC templates wherein bit 1 contained therein is L/2 or less because GC templates whose bits 01 are reversed have same property. In addition, from the constraint of the number of mismatches, it is shown that in case where the minimum distance is d, the number of bit 1 is at least (L-sqrt (L2−2dL))/2 (sqrt means square root). The GC templates can be efficiently obtained by using these constraints additionally. Further, when GC templates are designed such that the set S of oligonucleotide sequences constructed from GC templates is made to be a set of oligonucleotide sequences that contains or never contains particular subsequences such as restriction sites mentioned above, such designing corresponds to the narrowing of the space for exhaustive search, and therefore it contributes easier designing.
The set S of oligonucleotide sequences of the present invention can be designed by the step following the design of GC templates with the use of the Hamming distance mentioned above, which is the step using the theory of error correcting codes, that is, by combining codewords of any error correcting code with the designed GC templates to specify a set of oligonucleotide sequences, and by specifically substituting the positions 1 and 0 of GC template with bases of [AT] and [GC], or the positions 1 and 0 of GC template with bases of [GC] and [AT], respectively. As the codewords of error correcting codes mentioned above, any codewords can be used as long as they are known codewords of error correcting codes, and specific examples include Hamming codes, BCH codes, maximum-length codes, Golay codes, Reed-Muller codes, Reed-Solomon codes, Hadamard codes, Preparata codes, and reversible codes.
The motive for using the theory of error correcting codes is to ensure mismatches to complementary sequences in case where there occurs no shift (see claim 1). Therefore, as to the set S inducing mismatches in consideration of reverse sequence as well (see claim 2), it is not always necessary to use error correcting codes. Error correcting codes are a set of codewords wherein there are at least a certain number of mismatches between optional codewords. In case of preventing mishybridization between a set S and a set of reverse sequences thereof, it is only necessary to apply a set of codewords wherein there are at least a certain number of matches (not mismatches) between optional codewords. As for the set S of oligonucleotide sequences of the present invention, information of the codewords and GC templates are reflected on the sequences. Therefore, it is suffice to use error correcting codes maintaining the Hamming distance (the number of mismatches) k or more in order to ensure k mismatches to complementary sequences, and it is suffice to use codes maintaining the number of matches k or more in order to ensure k mismatches to reverse sequences.
In the theory of error correcting codes, codes wherein a redundant bit for detecting and correcting errors, which is called parity bit, is added to a given information bit to make the Hamming distance between optional codewords above a certain value, have been developed. The minimum value of the Hamming distance between codewords is called minimum distance. As the object of the code theory is to design the one that maintains the minimum distance largely and contains many codewords, there are many codes that meet the purpose of the present invention. For example, there are 4096 words of Golay code of code length 23 and minimum distance 7. With the use of this code, it is possible to design 4096 oligonucloetides for one GC template of length 23 (MD value is up to 9).
Next, it is explaned with specific example of the combination of error correcting codes and GC templates. As for GC templates, the Hamming code of minimum distance 3 and length L=7 is applied to 1101000 (upper) of MD(t)=2 and length L=7. It is ensured that the sequences thus constructed have at least two mismatches (in case shift does not occur, three mismatches) to any ligation or shift, on each side. For instance, if it is defined that 00 is A, 01 is T, 10 is G, and 11 is C, a set of 16 DNA sequences comprising 7 bases shown in [Table 5] whose GC content is 3/7 is given. Further, if it is defined that 00 is G, 01 is C, 10 is A, and 11 is T, a set of 16 DNA sequences comprising 7 bases shown in [Table 6] whose GC content is 4/7 is given.
The method for designing the set S of oligonucleotide sequences of the present invention using GC templates is specifically shown above. The method for designing the set S of oligonucleotide sequences of the present invention is not particularly limited, as seen from the above explanation, as long as it is a method for designing the set S of oligonucleotide sequences comprising the steps of selecting GC templates such that its Hamming distance to its reverse sequence, to its block shift, and its Hamming distance to the overlap part of its tandem concatenation, its concatenation with its reverse sequence, and the tandem concatenation of its reverse sequence, is equal to or above the predetermined value k, and in the following, an oligonucleotide sequence of length n is specified by the binary string of 0 and 1 (GC template) of predetermined length L (L is an integer 6 or more), meaning that the positions of G or C ([GC]), or A or T ([AT]) are fixed; and of combining the codewords of any error correcting code of minimum distance k with the selected GC template to specify a set of oligonucleotide sequences that induce at least k mismatches between any of them. However, the design method wherein the set of oligonucleotide sequences that maintains the Hamming distance k induces equal to or more than a fixed, predetermined number of mismatches against any of P sequences in the set S, a complementary sequence or reverse sequences of each of P sequences in the set S, sequences constructed by shifting these sequences, and sequences produced by ligation of these P sequences, of PC sequences or PR sequences, and of the P sequences and PC sequences or PR sequences in the set S, and wherein the set S of P sequences can avoid mishybridization between them, PC sequences or PR sequences, sequences constructed by shifting these sequences, and sequences produced by ligation of the P sequences, of their complementary sequences, and of the P sequences and PC sequences or PR sequences in the set S, is preferable.
Further, in the method for designing a GC template of the present invention, length n of oligonucleotide sequences in the predetermined set S, length L of GC templates, and the predetermined value k are as explained above, and the set S of oligonucleotide sequences is a set of oligonucleotide sequences that contains or never contains particular subsequences as explained above, and Hamming codes, BCH codes, maximum-length codes, Golay codes, Reed-Muller codes etc. can be used as the above-mentioned codewords of any error correcting codes as mentioned above.
So far, GC templates whose binary strings designate [GC], [AT] have been described. As application thereof, a design method using an AG template wherein each position designates A or G ([AG]), or T or C ([TC]) is exemplified. In order to do this, the definition of function MD in the GC template is redefined as follows.
MD(t)=min{H(t, TR), H(t, [tt]), H(t, [TRTR]), H(t, [tTR]), H([TRt]))
Here, symbol T means a binary string constructed by reversing 0 and 1 of all bits of template t (for example, when t=010101, then T=101010). The largest difference from GC template resides in the point that when a binary string that maximize this MD value is selected from among binary strings of given length L and this binary string is set to be t, the binary string of t designates [AG] or [TC], therefore, GC content of designed DNA sequences cannot be standardized in case where the binary strings of t are combined with error correcting codewords. In GC templates, position of GC is designated by 01 of the templates and position of AG is designated by 01 of the error correcting codewords. In AG templates, the designation of the positions is reversed. Therefore, it is impossible to standardize GC content with the use of optional error correcting codewords, it is necessary to use error correcting codes called constant-weight codes that have constant number of 1 in codewords. It is more difficult to design the constant-weight codes than generally used codes such as BCH codes or Hadamard codes that can use templates designating [GC] or [AT], but the constant-weight codes can be systematically designed with the use of the result described in reference BSS90 (IEEE Trans. On Information Theory, 36, pp. 1334-1380, 1990).
However, while constraints are imposed on available error correcting codes, it is possible to make the MD value of the templates, that is, the Hamming distance in consideration of shift and ligation, larger than that of the templates designating [GC] or [AT]. Further, it is found that the number of templates that have same MD value will be more than that of the templates designating [GC] or [AT]. The length L of AG template is 3 or more, preferably 3 to 100, more preferably 3 to 32, most preferably around 20, which is often used in experiments of molecular biology. If the length is 2 or less, the one having desired Hamming distance cannot be obtained. Further, the predetermined value k is not particularly limited as long as it is a value that allows oligonucleotide sequences constructed from the AG template to be the oligonucleotide sequences of the present invention that can avoid mishybridization. The value is preferably one-fifth or more, more preferably one-fourth or more, and most preferably one-third or more of the length L of the GC template.
As in the case of GC templates, when the length L is increased or MD value (k value) is decreased, many more AG templates will exist, however, an AG template of predetermined length and having the greatest k value (MD value) is particularly important. Examples of AG template of length L=3 to 32 and having the greatest k value (MD value) include; AG templates having length L=3 to 5, 6 to 8, 9, 10 to 12, 13 and 14, 15 to 18, 19, 20 to 22, 23, 24 to 26, 27, 28 to 30, 31, 32 and the predetermined value k=1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, respectively. The maximum value of the predetermined value k in the AG template of length L=3 to 30, the number of AG templates having the maximum value, and specific examples are shown in [Table 7]. The number of AG templates in [Table 7] contains all templates without omitting templates that become identical by cyclic shift or reversal.
The case using AG templates and the case using GC templates have a lot in common, for example, in both cases, it is preferable that the set S of oligonucleotide sequences is a set of oligonucleotide sequences that contains or never contains particular subsequences such as restriction sites. Though templates designating [AG] or [TC] have an advantage that they can maintain larger Hamming distance than templates designating [GC] or [AT], the number of the codewords of constant-weight codes is not so many in general. Therefore, from the viewpoint of the number of words that can be designed, GC templates are more flexible and have wide application. Further, GC templates have a great advantage that the melting temperature calculated by the nearest neighbor method used in biological experiments can be standardized because not only GC content but also alignment of GC bases can be standardized in all sequences. Therefore, AG templates can be handled also as one of possible variations.
The set S of oligonucleotide sequences of the present invention can be advantageously used as DNA or RNA tips, or DNA or RNA tags because orthogonalization between sequences makes it difficult to mishybridize with each other even if more than one kinds of oligonucleotide chains are fixed on a substrate in the high density. In addition, the set S of oligonucleotide sequences of the present invention is useful as primers for PCR, etc. because it is difficult to mishybridize with complementary sequences, as well. Further, the set S of oligonucleotide sequences of the present invention can be advantageously used for DNA computing system that comprises the steps of; artificially synthesizing DNA sequences in which various symbol manipulation operating systems such as logical expression and graph structure are recorded, and cutting and pasting the sequences according to protocols of molecular biological experiments, and sequences obtained at the end of the experiments are “calculation results of the DNA computing, because it has a specific sequence portion such as restriction sites in addition that it is difficult to mishybridize with each other.
The method for designing the set S of oligonucleotide sequences of the present invention makes it possible to efficiently and systematically design DNA sequences wherein it is difficult to mishybridize with each other due to the orthogonality of the sequences. Therefore, in biotechnology in general wherein information is written in DNA, the design method of the present invention is an essential technique for reducing experimental errors due to mishybridization of DNA. In addition, sequences that ensure the value of mishybridization can be systematically constructed by combining a set of GC templates obtained by the method for designing a GC template of the present invention and codewords of optional error correcting codes. Further, as the method for designing the set S of oligonucleotide sequences of the present invention fixes the site where GC or AT bases appear, there are following advantages.
(1) As GC content of the sequences can be standardized, physical property (in particular, melting temperature) of the sequences can be easily adjusted.
(2) By searching GC templates that match the sequence pattern, particular subsequences such as restriction sites can be introduced beforehand (optional subsequences can be incorporated into a designated sequence portion by making the portion correspondent to the information bit of error correcting codes).
(3) More than one GC template can be combined and used unless MD value does not decrease even if GC templates are ligated each other.
Number | Date | Country | Kind |
---|---|---|---|
2001-331732 | Oct 2001 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP02/11163 | 10/28/2002 | WO |