The present invention relates to a method for designing a DNA code which can be a simple, general information carrier for writing information into biopolymers as well as which can avoid errors occurring when artificially designed DNA is used as an information carrier, a DNA code obtained by the method for designing, and a technique for writing optional information into DNA by embedding the DNA codewords into an optional noncoding region not including any genetic information.
DNAs have a structure wherein four types of base, that is, adenine (A), cytosine (C), guanine (G) and thymine (T), are ligated together like a strand. Since A and T, and C and G form base pairs by hydrogen bond respectively, A-T and C-G are considered to be complementary. The two DNA strands have a complementary double helix structure, and the DNA double helix is separated into single-stranded DNAs when temperature rises, and the single-stranded DNAs bind to complementary strands again when temperature drops. This process of binding to complementary strands is called hybridization, and it is well known that the temperature at which DNA strands separate or hybridize depends on GC content in the sequence. Further, a noncomplementary base pair in a double strand cannot form stable hydrogen bond and it is called a (base) mismatch. The stability (e.g. free energy) of a DNA double helix depends on the number and distribution of base mismatches (see e.g. Biochemistry 37, 26, 9435-9444, 1998). Plural oligonucleotide sequences corresponding to the letters are prepared in order to write information by using this DNA. A set of artificial oligonucleotide sequences of fixed length is used in many fields of application as set forth below.
For instance, as biotechnology advances, artificial gene engineering is performed routinely; protecting the copyright of the modified gene has been emphasized. However, a gene has no major feature particularly except that it is constituted by combination of 4 bases, and the method for characterizing the cells of organisms, gene fragments, or the like which are newly generated by gene engineering to protect them from abuse, has not been established yet. In order to limit the use or piracy unintended by the developers, DNA signature or DNA steganography (an externally invisible signature, achieved by hiding the signature in the other information) is regarded as useful. It is actualized by, for instance, denoting the information with signature as a DNA base sequence to locate the origin of the DNA, and incorporating the base sequence for location into artificially modified genome (see, e.g. Japanese Laid-Open Patent Application No.2001-352980). Oligonucleotide sequences of fixed length are artificially designed and used as sequences for signature in practical use.
In addition, there is quite a new computation called “DNA computation”, representing computing paradigms unlike the current computation (see e.g. Science 266, 5187, 1021-1024, 1994) In this field of study, symbol processing is realized by denoting logical variables or graph components as base sequences of DNA for solving mathematical problems and applying experimental methods in molecular biology to the base sequences. A set of artificially designed oligonucleotide sequences of fixed length is used here, too.
Moreover, DNA tag/antitag system (see, e.g. Proceedings of the National Academy of Sciences of USA 89, 12, 5381-5383, 1992, Proceedings of the National Academy of Sciences of USA 97, 4, 1665-1670, 2000, and Journal of Computational Biology 7, 3-4, 503-519, 2000), is used for monitoring gene expressions with the use of oligonucleotide tags of fixed-short length. These tags can be regarded as codes denoting information corresponding to respective genes. Other than this system, a method for using DNA as a future medium for data storage (see, e.g. 10th Foresight Conference on Molecular Nanotechnology (Bethesda, USA) Poster abstract, 2002) has been also advocated. Oligonucleotide sequences of fixed length are used for denoting respective data in these approaches, too.
All of the above techniques intend to write information into base sequence and require design of “DNA codes”. Here, the DNA code is a set of base sequences different from each other but having the same length. The constraints that thus designed DNA codes should satisfy are following: all codewords (base sequences) must have constant physical properties such as melting temperature, and they do not induce unwanted hybridization (mishybridization) between codewords, and the method for designing has much in common with the method for designing the classical error-correcting codes. However, design of DNA code is different from that of error-correcting codes in some points; there is no standard method for designing codewords. Three basic approaches which have been used for design of DNA codewords conventionally are described below: (1) the template-map strategy, (2) De Bruijn construction, and (3) the stochastic method.
(Template-Map Strategy)
This method for designing was first proposed by Condon's group (see, e.g. Nucleic Acids Research 25, 23, 4748-4757, 1997). The basic idea is to divide constraints on the DNA code and separately assign them into two binary codes, and to combine them together to constitute a quaternary code (a DNA code). For instance, one binary code (called a template) keeping GC content constantly and the other binary code (called a map) ensuring mismatches between any codewords, are combined to design a quaternary code which fulfills both constraints. Frutos et al. designed 108 words of DNA codes of length 8 to have following features: (1) each codeword has four GCs, and (2) there are at least four mismatches between each of codewords, including complementary sequence (see, e.g. Nucleic Acids Research 25, 23, 4748-4757, 1997). Further, Li et al., used the Hadamard code, generalized this method for designing to longer DNA code (see, e.g. Langmuir 18, 3, 805-812, 2002). They presented, as an example, the design of 528 words of DNA code of length 12 with six minimum mismatches.
As a DNA code is produced by combining two binary codes in the template-map strategy, the DNA code designed by using this technique can only fulfill the properties which are studied with binary codes, conventionally. However, DNA, unlike the code used electronically, cannot specify the comma of codewords, therefore, it is necessary to have the system to necessarily detect the shift when a reading frame of codeword is shifted. The property is referred to as comma-free since it does not need comma. A code necessarily producing d number of mismatches (when the reading frame is shifted) between concatenation of a codeword and each codeword is referred to as a comma-free code of index d. Unfortunately, a theory regarding comma-free codes of high index has seldom been studied in binary codes. Therefore (see, e.g. IEEE Transactions on Information Theory, IT-11, 107-112, 1965, and Stiffler, J. J., Theory of Synchronous Communication. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1971), comma-freeness cannot be conferred to DNA codes in the template-map strategy.
(De Bruijn Construction)
The longer a consecutive run of matched base pairs, the higher is the risk of mishybridization. Accordingly, it is necessary to impose a constraint (a subword constraint) without a consecutive bases match of length k (k: generally 7 to 8). Ben-Dor et al. showed an optimal choosing algorithm of oligonucleotide tags that satisfy the subword constraint of length k by cleaving a sequence of length k sharing the same melting temperatures from De Bruijn sequence of order k (see, e.g. Journal of Computational Biology 7, 3-4, 503-519, 2000). De Bruijn sequence of order k is a circular sequence of length 2k in which each of sequences of length k occurs exactly once. A linear time algorithm for the construction of a De Bruijn sequence is known.
There are other similar techniques using a De Bruijn sequence and DNA chips using the tags constructed in this manner are commercially available (see, e.g. European Patent No.97302313 and Genome Research 10, 6, 853-860, 2000).
The oligonucleotide sequence chosen from the De Bruijn sequence of order k does not have a consecutive match of length k or longer, therefore, a DNA codeword of length 2k or longer can avoid a complete match of the concatenation of a codeword with the other codeword (a comma-free code of index 1). In fact, Brenner applied the comma-free code of index 1 to the design of oligonucleotide tags (see, e.g. U.S. Pat. No. 5,604,097, Proceedings of the National Academy of Sciences of USA 89, 12, 5381-5383, 1992, and Proceedings of the National Academy of Sciences of USA 97, 4, 1665-1670, 2000). However, it is difficult to confer comma-free codes of index 2 or more, when the De Bruijn sequence is used. Further, it is also difficult to guarantee the number of mismatches between codewords designed with the use of De Bruijn sequence. Therefore, it is highly difficult to design DNA codes having high comma-freeness of index and large number of mismatches between codewords.
(Stochastic Method)
The stochastic method is the most widely used approach in code design. Deaton et al. used genetic algorithms to find codewords sharing similar melting temperatures as well as satisfying the ‘extended’ Hamming constraint, i.e. a constraint where mismatches in the case of shift are also considered (see, e.g. DNA Based Computers II, DIMACS Series in Discrete Mathematics and Theoretical Computer Science 44, 247-258, 1998). According to their report, due to the complexity of the problem, genetic algorithms can only be applied to design of the codewords of up to length 25 (see, e.g. Proceedings of the 3rd Annual Genetic Programming Conference, Morgan Kaufmann 684-690, 1998).
Landweber et al. used a random codeword-generation program to design two sets of 10 codewords of length 15. Thus designed sequence satisfies following conditions: (1) no more than five consecutive base matches in ligation of any codewords, (2) standardized melting temperatures of 45° C., (3) avoidance of secondary structures, and (4) no consecutive combinations of more than seven base pairs (the fourth condition is not necessary when the first condition is satisfied. Here, conditions appearing in the original text are shown.). They realized these constraints with only three types of bases (see, e.g. Proceedings of the National Academy of Sciences of USA 97, 4, 1385-1389, 2000). Other groups who designed codewords with only three types of bases likewise employed random codeword-generation for design (see, e.g. DNA Computing: 6th International Workshop on DNA-Based Computers (DNA 2000; Leiden, The Netherlands), LNCS 2054, 17-26, 2001, and Science 296, 5567, 499-502, 2002).
Although no theoretical analysis for algorithms used in stochastic method has been performed yet, the power of the technique is evident in the work of Tulpan et al. (see, e.g. Proceedings of 8th International Meeting on DNA-Based Computers (DNA 2002; Sapporo, Japan), 311-323, 2002). By using the stochastic method, they could increase the number of codewords designed by the template-map strategy, while they failed in outperforming the design by the template-map strategy with the use of the stochastic method alone. Therefore, it is preferable to apply the stochastic method for increasing the number of already designed codewords. Defects of the stochastic method are exemplified as follows: the designed codeword differs every time it is designed (since it is stochastic), the number of codewords which can be designed cannot be assumed, and the feature (e.g. the number of mismatches) of the codeword to be designed cannot be assumed in advance.
Conventional methods for designing are shown as set forth above, all of which have defects, so they cannot be the ideal methods for designing. The ideal codewords should satisfy the various constraints described below.
(Hamming Distance Constraints)
Designed DNA codes should keep a large Hamming distance between all codewords. What makes the DNA code-design more complicated comparing to the theory of error-correcting code is that the number of mismatches in the hybridization not only with the codewords but also with their complementary sequences must be considered.
(Comma-Free Constraints)
Comma-freeness is referred to as a property which guarantees the predetermined number of mismatches not only when the reading frames of the codewords are overlapped but also when the reading frames of the sequence are shifted. Since DNA does not have a fixed reading frame, it is desirable that the designed code is comma-free. By definition, a code is comma-free of index d when the concatenation of codewords x1 x2 . . . xn and y1 y2 . . . yn, (i.e. xr+1 xr+2 . . . xn y1 y2 . . . yr; 0<r<n), which are any 2 codewords not necessarily different, has necessarily d or more of mismatches with the other codeword (see, e.g. Canadian Journal of Mathematics 10, 202-209, 1958, and Canadian Journal of Mathematics 39, 3, 513-526, 1987). Thus, DNA codewords should be comma-free of high index. Here, it should be noted that the property of comma-freeness is not compensated by introducing ‘spacer’ codewords between codewords. Presence of the spacers may facilitate decoding codewords, but it does not contribute to the avoidance of mishybridization. Moreover, spacers lower its information content as they introduce excess DNA sequences between each codeword.
(Energy Constraints)
In addition to the above constraints on mismatches, the melting temperatures of DNA codes are necessarily to be standardized for guaranteeing the unbiased behavior in experiments. There are several formulas to estimate the melting temperature: (1) for very short oligonucleotides, the GC content or the 2-4 rule (in the 2-4 rule, melting temperature is estimated as (the number of AT base pairs)×2+(the number of GC base pairs)×4° C.), (2) for relatively short oligonucleotides, the approximation using the nearest neighbor base pair method (see, e.g. Proceedings of the National Academy of Sciences of USA 83, 11, 3746-3750, 1986 and Biochemistry 37, 26, 9435-9444, 1998), and (3) for longer oligonucleotides, Wetmur's approximation (see, e.g. Critical Reviews in Biochemistry and Molecular Biology 26, 3-4, 227-259, 1991). Using one of these formulas, all codewords can be designed so that their melting temperatures are within a narrow range.
(Other Constraints)
Following constraints in terms of base mismatches, depending on the model used, are known.
As aforementioned, as bio- and nano-technology advances, the demand for writing information into DNA increases. The field to which the technique is applied is unlike conventional biotechnology in that artificial information is tried to be written into DNA. Although various design strategies for DNA code have been proposed, the aim of those strategies is not providing the standard code (like the ASCII code) for using DNA as an information carrier. Presumably, it is because constraints to be satisfied by DNA sequences depend on the fields where the respective strategies are used. A simple, versatile code is required when DNA is used as an information carrier.
When information is written or read in DNA, following phenomena should be taken into account.
The technique regarding conventional DNA code does not consider misreading, since the theory thereof is constructed based on the hypothesis that written information can be sequenced from DNA “in its entirety”. Further, it does not consider primers as well or it merely proposes a very ambiguous solution such as “preparing specific sequences at the both ends of the information to be embedded into DNA”. In addition, the conventional method does not show specific means for writing information into DNA, accordingly, it does not indicate techniques for standardizing the physical properties and preventing the appearance of the specific sequence, too. There are a number of experimental constraints for replication of genetic information, so even high level of technology does not enable replication of genetic information without any errors. Further, even if errors can be eliminated at replication stage, mutation of the sequence by biomolecule or radiation should be considered when the information sequence is written into DNA of living body.
Therefore, the object of the present invention lies in provision of a method of designing a set of base sequences for codes (a set of symbols which are given meanings artificially by alphabet or the like), used as information carriers to read or write optional information into optional noncoding regions not including any DNA genetic information, i.e., a method of designing DNA codes. The codewords of the DNA codes can correspond to the code- system used by computer, and they have characteristics in that any arrangement of the letters permits decode of codewords with very high reliability. This DNA codeword, having features utterly different from those of natural DNA, can be embedded into an optional area not including any DNA genetic information. Further, the DNA codewords prepared by the method for designing of the present invention can also be utilized as a storage media of information.
The inventor previously proposed: a method for systematically designing a set S1 of oligonucleotide sequences of predetermined length n (n is an integer, 3 or more, preferably, 6 or more), wherein each of oligonucleotide sequences in the set S1 induces equal to or more than a fixed number of mismatches against any of oligonucleotide sequences in the set S1, complementary sequences of each of oligonucleotide sequences in the set S1, sequences constructed by shifting these sequences, and sequences produced by ligation of these oligonucleotide sequences, of their complementary sequences, and of the oligonucleotide sequences and their complementary sequences, wherein the set S1 of oligonucleotide sequences can avoid mishybridization between any of said oligonucleotide sequences, said complementary sequences, sequences constructed by shifting these sequences, and sequences produced by ligation of said oligonucleotide sequences, of said complementary sequences, and of said oligonucleotide sequences and said complementary sequences; and a method for systematically designing a set S1 of oligonucleotide sequences which can avoid mishybridization for reverse sequences as well as for complementary sequences (Japanese Patent Application No. 2001-331732).
The present inventor has conducted an intensive study to solve the above-identified problem, as it is necessary not only to maintain error-correcting function but also physical property such as meting temperatures homogenous for design of sequences to embed information into DNA, the inventor found a method for designing DNA code satisfying all these conditions by following steps: further selecting a template having a subword constraint of length m from the templates used in designing the above-mentioned set of oligonucleotide sequences by the present inventor, and combining it with codewords of predetermined error-correcting codes having also a subword constraint of length m to make them a set of S2 of base sequences which can be used as letters in describing information, and the present inventor realized the correspondence of a conventional code system including ASCII and a code system by DNA base sequence. The present invention has thus been completed.
That is, the present invention provides: a method for designing a DNA code, comprising the following steps: 1) selecting a binary string (GC templates) such that all of its Hamming distance against its reverse sequence, its block shift, and the distance against the overlap part of its tandem concatenation, its concatenation with its reverse sequence, and the tandem concatenation of its reverse sequence are equal to or above the predetermined value k, and in the following, an oligonucleotide sequence of predetermined length n (n is an integer 6 or more) is specified by the binary string of 0 and 1 (GC template) of predetermined length L (L is an integer 6 or more), meaning that the position of G or C ([GC]), or A or T ([AT]) are fixed; 2) selecting a set having a subword constraint of length m as a template from the set of the selected GC templates; and 3) constructing a set S1 of the oligonucleotide sequences by combining codewords of the predetermined error-correcting codes having a subword constraint of length m likewise (“1”); a method for designing a DNA code, comprising following steps: 1) selecting a binary string (AG template) such that its Hamming distance against its reverse inverted sequence, its block shift, and the distance against the overlap part of its tandem concatenation, its concatenation with its reverse inverted sequence, and the tandem concatenation of its reverse inverted sequence are equal to or above the predetermined value k, and in the following, an oligonucleotide sequence of predetermined length n (n is an integer 6 or more) is specified by the binary string of 0 and 1 (AG template) of predetermined length L (L is an integer 6 or more), meaning that the position of A or G ([AG]), or T or C ([CT]) are fixed; 2) selecting a set having a subword constraint of length m as a template from the set of the selected AG templates; and 3) constructing a set S1 of oligonucleotide sequences by combining the codewords of predetermined error-correcting codes having a subword constraint of length m likewise (“2”); a method for designing a DNA code, wherein any of oligonucleotide sequences of the set S1, of which Hamming distance is kept equal to or above k, induces mismatches equal to or above the predetermined value against any of the sequences, their complementary sequences, sequences constructed by shifting these sequences, and sequences produced by ligation of sequences in the set S1, of their complementary sequences, and of the sequences and their complementary sequences, and wherein the sequence in the set S1 can avoid mishybridization between them, their complementary sequences, sequences constructed by shifting these sequences, and sequences produced by ligation of the sequences in the set S1, of their complementary sequences, and of the sequences and their complementary sequences, which facilitates decoding information (“3”); a method for designing a DNA code, wherein the set S1 of oligonulcleotide sequences of predetermined length n is a set S1 of oligonucleotide sequences of length 32 or less (“4”); a method for designing a DNA code, wherein the predetermined value k of Hamming distance is one-fourth of L or more (“5”); a method for designing a DNA code, wherein the subword constraint of length m is half of L or more (“6”); a method for designing a DNA code, wherein the set S1 of oligonucleotide sequences is a set of oligonucleotide sequences that contains or never contains a particular subsequence (“7”); a method for designing a DNA code, wherein the codewords of the predetermined error-correcting code are selected from Hamming codes, BCH codes, maximum-length codes, Golay codes, Reed-Muller codes, Reed-Solomon codes, Hadamard codes, Preparata codes, reversible codes, constant-weight codes, or nonlinear codes (“8”); and a method for designing a DNA code, wherein a set of base sequences corresponding to a symbolic unit has a sequence unlike that of natural DNA, and has a constant alignment of [GC][AT] or [CT][AG].
Further, the present invention provides a DNA code consisting of a set of base sequences corresponding to a symbolic unit, which can write optional information into an optional noncoding region not including any DNA genetic information by using a code system decoded by computer (“10”); a DNA code having a constant alignment of [GC][AT] or [CT][AG], and consisting of a set of base sequences designed so that their melting temperatures are standardized in the same predetermined range (“11”); a DNA code consisting of a set of base sequences in which an error such as skip or substitution of some bases is easily detected (“12”); a DNA code comprising an error-correcting function which can decrypt (decode) with high reliability even in the presence of an error such as shift of a reading frame of a base sequence corresponding to a symbolic unit or substitution of plural bases (“13”); a DNA code which does not form a stable secondary structure with base sequences corresponding to a symbolic unit, wherein physical inhibition to inhibit amplification by a primer does not occur in any ligation of letters (“14”); a DNA code consisting of a set of base sequences corresponding to a symbolic unit, which is easily distinguished from natural DNA (“15”); a DNA code, wherein a base alignment is limited in a base sequence, with which whether a specific subsequence appears or not is easily examined (“16”); a DNA code consisting of 112 codewords of length 12, showing mismatches at least at four positions in any hybridization, having at most six consecutive subsequences, and maintaining the same melting temperature in the approximation using the nearest neighbor method (“17”); a DNA code which can be obtained according to any one of the methods for designing described in above (“18”); and a method for writing optional information into DNA, wherein the DNA code is embedded into an optional noncoding region not including any DNA genetic information (“19”).
The present invention still further provides: a method for writing optional information into DNA, wherein the DNA is a vector DNA (“20”); a method for writing optional information into DNA, wherein the DNA is a genomic DNA (“21”); a method for writing optional information into DNA, wherein a DNA creator can be identified by the DNA code (“22”); a labeled vector wherein the DNA codes are embedded into an optional noncoding region not including any DNA genetic information (“23”); a labeled cell, wherein the DNA codes are embedded into an optional noncoding region not including any DNA genetic information (“24”); and a DNA tag having the DNA codes (“25”).
According to the present invention, DNA codes having following features can be designed.
The method for designing a DNA code of the present invention is not particularly limited to as long as it is a method for constructing a set S1 of oligonucleotide sequences corresponding to a signal unit in signaling, comprising the following steps: 1) selecting a binary string (GC templates) such that its Hamming distance against its reverse sequence, its block shift, and the distance against the overlap part of its tandem concatenation, its concatenation with its reverse sequence, and the tandem concatenation of its reverse sequence are equal to or above the predetermined value k, and in the following, an oligonucleotide sequence of predetermined length n (n is an integer 6 or more) is specified by the binary string of 0 and 1 (GC template) of predetermined length L (L is an integer 6 or more), meaning that the position of G or C ([GC]), or A or T ([AT]) are fixed; 2) selecting a set having a subword constraint of length m as a template from the set of the selected GC templates; and 3) combining codewords of the predetermined error-correcting codes having a subword constraint of length m likewise; or comprising the following steps: 1) selecting a binary string (AG template) such that its Hamming distance against its reverse inverted sequence, its block shift, and the distance against the overlap part of its tandem concatenation, its concatenation with its reverse inverted sequence, and the tandem concatenation of its reverse sequence are equal to or above the predetermined value k, and in the following, an oligonucleotide sequence of length n (n is an integer 6 or more) is specified by the binary string of 0 and 1 (AG template) of predetermined length L (L is an integer 6 or more), meaning that the position of A or G ([AG]), or T or C ([CT]) are fixed; 2) selecting a set having a subword constraint of length m as a template from the set of the selected AG templates; and 3) combining the codewords of predetermined error-correcting codes having a subword constraint of length m likewise. DNA sequence and RNA sequence are included in the above oligonucleotide sequences; “a method for designing an RNA code as an information carrier” is also included in the above “a method for designing a DNA code as an information carrier” for the sake of convenience. Meanwhile, in the present invention, encoding means relating a specific base sequence to letters or symbols in order to process the letters or symbols by computer, while a DNA code is referred to as a set of signal units (letters such as alphabet, which may be called DNA codewords) represented using DNA as a medium. The DNA code which can be obtained by the method for designing of the present invention can be advantageously used when optional information is written into an optional noncoding region such as intron, 5′-noncoding region, or 3′-noncoding region, not including any DNA genetic information.
Upper limit of the predetermined length n (n is an integer 6 or more) of the above oligonucleotide sequences is not limited, but it comprises generally 100 bases, preferably 32 bases, and the subset of the set S1 is also included in the set S1 of the above oligonucleotide sequences for the sake of convenience. Hereinafter, it is described how the DNA codes consisting of a set of base sequences corresponding to a signal unit such as alphabet using the set S1 inducing mismatches is designed with the use of a GC template mainly, focusing the case where the oligonucleotide sequence is a DNA sequence, including the case of complementary sequences.
The P sequences in the above set S1 designed by using a template not only induce mismatches of predetermined value or more between the sequences themselves, and between the P sequences and other P sequences in the set S1, in both cases where sequences are shifted (sequences are staggered) and not shifted and can avoid mishybridization, but also induce mismatches of predetermined value or more between the P sequences and PC sequences which are complementary sequences of each of other oligonucleotide sequences (excluding the P sequences themselves) in the set S1, that is, PC sequences constructed by substituting T, A, C and G for A, T, G and C in the P sequences respectively, and reversing the direction of 5′ and 3′, in both cases where sequences are shifted and not shifted, and can avoid mishybridization. The P sequences further induce mismatches of predetermined value or more between the P sequences and oligonucleotide sequences constructed by ligating each of oligonucleotide sequences in the set S1, that is, ligated sequences of P sequences, and ligated sequences of PC sequences, ligated sequences of P sequences and PC sequences, ligated sequences of PC sequences and P sequences, etc., and can avoid mishybridization. Here, mismatch means a pairing with bases other than complementary bases in hybridization, and as mismatches of predetermined value or more, there is no particular limitation as long as it is the number of mismatches with which mishybridization can be avoided, however, it is preferable that mismatches are one-fifth or more, more preferably one-fourth or more, and most preferably one-third or more of predetermined length n (n is an integer 6 or more) of oligonucleotide sequences.
Further, it is preferable that the oligonucleotide sequence consisting of the above set S1 can be processed as a set of sequences with which it is possible to easily locate the position where a particular subsequence appears. Examples of the particular subsequences include restriction sites; expression signal sequences including poly A portions of RNA, ATG which is a translation initiation codon, TAA, TAG, TGA, etc. which are stop codons; consensus sequences GCCAATCT, ATGCAAAT, recognized by transcription factors, and optional DNA sequence signal such as base sequences encoding variable regions of antibodies.
The afore-mentioned set S1 of oligonucleotide sequences can be usually designed in two steps. A GC template is designed with the use of the Hamming distance at the first step, and the set S1 of oligonucleotide sequences of the present invention as an object is designed using the set of oligonucleotide sequences represented by the designed GC templates by using the theory of error-correcting codes at the next step. It is determined at the first step whether each of the positions in the sequences is [GC] or [AT]. This position is represented by a GC template comprising 0 and 1; b1 b2 . . . b1 (b1 ε{0, 1}), and 1 and 0 mean [AT] and [GC], respectively, or, 1 and 0 mean [GC] and [AT], respectively. Therefore, not 4L kinds but 2L kinds of sequences are represented by a GC template of length L. At the next step, base sequences are determined by specifically substituting bases [AT] for the position 1, and bases [GC] for the position 0, or bases [GC] for the position 1, and bases [AT] for the position 0 by a GC template.
The Hamming distance mentioned above is used as a scale for similarity between sequences. For example, the Hamming distance between two strings x=x1, x2, . . . xn and y=y1, y2, . . . yn is defined as the number of index i that complies with the condition of xi≠yi. In addition, as mishybridization between DNA sequences can occurr even when sequences are shifted (staggered), it is necessary to consider the Hamming distance in the case where sequences are shifted. Since “shift” occurs when one sequence is longer than the other, in case of |x|<|y|, the Hamming distance between the two strings is made to be the minimum value of the Hamming distance between x and each of |y|−|x|+1) subsequences of length |x| contained in y. The Hamming distance indicated by this minimum value can be represented by H (x, y).
Next, function MD (abbreviation of minimum distance) against a GC template t is considered in order to obtain the Hamming distance between a GC template t and ligated sequences of the GC templates t, ligated sequences of reverse sequences tR of the GC templates t, ligated sequences of the GC templates t and reverse sequences tR. The above-mentioned reverse sequence tR of GC template means a sequence wherein a binary string of the GC template t is aligned reversely. As the Hamming distance between a GC template t and a GC template t, its reverse sequence tR, which are sequences at both outer sides of ligated sequences, is already obtained, it is suffice to consider sequences wherein one letter each is deleted from both ends of ligated sequences when obtaining minimum value of the Hamming distance by shifting GC templates t against ligated sequences, consequently, it is convenient to use a symbol [ ] in a mathematical formula of MD (t). The meaning of symbol [ ] is: [s1 s2 s3 . . . sm−1 sm]=s2 . . . sm−1, that is, it means a sequence wherein one letter each is deleted from both ends. Therefore, the minimum distance MD (t) of the Hamming distance between GC templates t and ligated sequences is represented by the following formula.
MD(t)=min{H(t, tR), H(t, [tt]), H(t, [ttR]), H(t, [tRt]), H(t, [tRtR])}.
Consequently, in case where MD(t)=k(k≧0) for a GC template t, at least Hamming distance k is ensured for sequences [tt], [ttR], [tRt], [tRtR], including ligating parts thereof, wherein one letter each is deleted from both ends of ligated sequences, when a GC template t is shifted against ligated sequences.
Thus, the method for designing a GC template mentioned above is used at the first step of constructing the set S1 of oligonucleotide sequences mentioned above. As seen from the above explanation, the method for designing a GC template is not particularly limited as long as it is a method comprising selection of GC templates such that its Hamming distance against its reverse sequence, its block shift, and the distance against the overlap part of its tandem concatenation, its concatenation with its reverse sequence, and the tandem concatenation of its reverse sequence, are equal to or above the predetermined value k, in the following, an oligonucleotide sequence of predetermined length n is specified by the binary string of 0 and 1 (GC template), meaning that the positions of [GC], or [AT] are fixed. However, the length L of GC template is 6 or more, preferably 6 to 100, more preferably 6 to 32, most preferably around 20, which is often used in experiments of molecular biology. If the length is 5 or less, the one having desired Hamming distance cannot be obtained. By using the GC template having the length L, a set S1 of oligonucleotide sequences of corresponding length n can be obtained. Further, the predetermined value k is not particularly limited as long as it is a value that allows oligonucleotide sequences constructed from the GC template to be the oligonucleotide sequences of the present invention that can avoid mishybridization. The value is preferably one-fifth or more, more preferably one-fourth or more, most preferably one-third or more of the length L of the GC template.
In general, when the length L is increased or MD value (k value) is decreased, many more GC templates will exist, however, a GC template of predetermined length and having the greatest k value (MD value) is particularly important. Examples of GC templates of length L=6 to 32 and having the greatest k value (MD value) include: GC templates having length L=6 to 10, 11 to 15, 16 to 18, 19, 20 to 22 and 24, 23 and 25, 26 and 27, 28 and 29, 30 to 32, and the predetermined value k=2, 4, 6, 7, 8, 9, 10, 11, 12, respectively. The maximum value of the predetermined value k in the GC templates of length L=6 to 32, the number of GC templates having the maximum value, and specific examples are shown in [Table 1]. In addition, the shortest GC templates that fulfill specific MD value (k value) are shown in [Table 2]. Further, specific examples for GC templates of length L=11 to 27 and those for GC templates of length L=28 to 30 are shown in [Table 3] and [Table 4], respectively. In [Table 2], GC templates are enumerated excluding the ones that have the same reverse sequences or sequences wherein 0 and 1 are reversed, and in [Table 3] and [Table 4], “items” are the numbers after omitting GC templates that become identical by cyclic shift.
The GC template sequences enumerated in [Table 1] to [Table 4], etc., can be selected by searching exhaustively all patterns from sequences comprising only 0 to sequences comprising only 1, by a person skilled in the art. However, there is no need to search all 2L patterns to find a GC template of length L. It is suffice to take into account the GC templates wherein bit 1 contained therein is L/2 or less because GC templates whose bits 01 are reversed have same property. In addition, from the constraint of the number of mismatches, it is shown that in case where the minimum distance is d, the number of bit 1 is at least (L−sqrt(L2−2 dL))/2 (sqrt means square root). The GC templates can be efficiently obtained by using these constraints additionally. Further, when GC templates are designed such that the set S1 of oligonucleotide sequences constructed from GC templates is made to be a set of oligonucleotide sequences that contains or never contains particular subsequences such as restriction sites mentioned above, such designing corresponds to the narrowing of the space for exhaustive search, and therefore it contributes to easier designing.
Following to the step of designing GC templates by using the Hamming distance mentioned above, the set S1 of oligonucleotide sequences mentioned above can be designed at the step in which the theory of error-correcting codes are used from the set of oligonucleotide sequence represented by the designed GC templates, that is, by combining codewords of any error-correcting code. As for the codewords of error-correcting codes mentioned above, any codewords can be used as long as they are known codewords of error-correcting codes, and specific examples include Hamming codes, BCH codes, maximum-length codes, Golay codes, Reed-Muller codes, Reed-Solomon codes, Hadamard codes, Preparata codes, reversible codes, constant-weight codes, and nonlinear codes.
The motive for using the theory of error-correcting codes is to ensure mismatches to complementary sequences in case where there is no shift. Therefore, as to the set S1 in consideration of reverse sequence, it is not always necessary to use error-correcting codes. Error-correcting codes are a set of codewords wherein there are at least a certain number of mismatches between optional codewords. In case of preventing mishybridization between a set S1 and a set of reverse sequences thereof, it is only necessary to apply a set of codewords wherein there are at least a certain number of matches (not mismatches) between optional codewords. As for the set S1 of oligonucleotide sequences mentioned above, information of the codewords and GC templates are reflected on the sequences. Therefore, it is suffice to use error-correcting codes maintaining the Hamming distance (the number of mismatches) k or more in order to ensure k mismatches to complementary sequences, and it is suffice to use codes maintaining the number of matches k or more in order to ensure k mismatches to reverse sequences.
In the theory of error-correcting codes, codes wherein a redundant bit for detecting and correcting errors, which is called parity bit, is added to a given information bit to make the Hamming distance between optional codewords a certain value or above, have been developed. The minimum value of the Hamming distance between codewords is called minimum distance. As the object of the code theory is to design the one that maintains the minimum distance largely and contains many codewords, there are many codes that meet the purpose of the present invention. For example, there are 4096 words of Golay codes of code length 23 and minimum distance 7. With the use of this code, it is possible to design 4096 oligonucloetides for one GC template of length 23 (MD value is up to 9).
In order to prepare oligonucleotide sequence fulfilling stricter constraints, for general-purpose DNA codes, a subword constraint of length m should be considered together when a template used in set S1 mentioned above is selected. When the set is selected, binary string of 0 and 1 is designed so that it is presenct consecutively m or more between templates constructing a set S1, and the distance between codewords is designed so that the binary string does not match consecutively m or more between codewords by using obvious transformation to the Max Clique Problem from error-correcting codewords. As for m value in subword constraint of length m, the value 10 or less is preferable in that mismatches can be fully dispersed. When L is 12, 7 can be exemplified as m value.
For instance, combining 001110010000, 001001010100, 000000000000, 010001110101, 111010011000 (lower) as the codewords of nonlinear codes of length L=12 having a subword constraint of minimum distance 4, length 7 with 000110011101 and 001010111100 (upper) of length L=12 having a subword constraint of MD(t)=4, length 7 as for a template in a set S1, results that the obtained bases induce at least four mismatches against any concatenations, sifts, in which 7 bases or more of base sequences not inducing mismatches is not present consecutively. For instance, when 00 is A, 01 is T, 10 is G, and 11 is C, ten sets of DNA sequences consisting of 12 bases shown in Table 5 whose GC content is ½ are obtained. Further, when 00 is G, 01 is C, 10 is A, and 11 is T, ten sets of DNA sequences consisting of 12 bases shown in Table 6 whose GC content is ½ are obtained.
Next, the DNA code of the present invention is not particularly limited as long as it can write optional information into an optional noncoding region not including any DNA genetic information by using a code system decodable by computer such as binary code and the DNA code consists of a set of encoded base sequences, but followings are preferable: a DNA code consisting of a set of base sequences which is encoded so that not only GC content but also alignment of GC bases are same and the melting temperatures estimated by approximation using the nearest neighbor method used in experiments of molecular biology are in the predetermined range, a DNA code consisting of a set of encoded base sequences in which an error such as skip or substitution of some bases is easily detected, a DNA code comprising an error-correcting function which can decode with high reliability even in the presence of an error such as an shift of reading frame of encoded base sequences or substitution of plural bases, a DNA code which does not form a stable secondary structure with encoded base sequences, wherein physical inhibition to inhibit amplification by a primer does not occur in any ligation of codewords, a DNA code consisting of a set of encoded base sequences corresponding to letters, which can be easily distinguished from natural DNA, and a DNA code wherein a base alignment is limited and appearance of a specific subsequence can be easily located. The DNA code can be obtained by the method for designing DNA code of the present invention. A DNA code consisting of 112 codewords of length 12, which induces mismatches at least at four positions between codewords in any ligation of codewords including their complementary sequences and at most 6 consecutive matches of bases prevents mishybridization, and further maintains the same melting temperature in approximation using the nearest neighbor method, can be cited as a specific example.
As for method for writing optional information by using the DNA of the present invention, it is not specifically limited as long as it is a method wherein the DNA code of the present invention mentioned above, consisting of a set of base sequences corresponding to letters such as alphabet, is embedded into an optional noncoding region such as intron, 5′-noncoding region, or 3′-noncoding region, not including any DNA genetic information. As for the DNA in which the DNA code of the present invention is embedded, a vector DNA such as a plasmid vector DNA and a viral vector DNA, and a genomic DNA of animal or plant cell and microbial cell can be exemplified. The method for writing optional information into the DNA of the present invention allows DNA signature by embedding DNA codes corresponding to letters such as alphabet with which the creator can be identified, into an optional noncoding region not including any DNA genetic information. The present invention also relates to a labeled vector or labeled cells in which the DNA code of the present invention is embedded in an optional noncoding region not including any DNA genetic information, and with which the creator can be identified.
Though plural types of oligonucleotide strands consisting of the DNA codes of the present invention are fixed in high density on a substrate, the sequences do not often cause mishybridization each other; consequently, the set of encoded base sequences of the present invention can be advantageously applied in DNA tip or RNA tip, or as DNA tag or RNA tag. Further, they do not often cause mishybridization with their complementary sequences, so the set of encoded base sequences of the present invention are useful as primers in PCR or the like. Moreover, since the set of encoded base sequences of the present invention can be easily proved that they do not have particular subsequences such as restriction site in addition to that they do not often cause mishybridization, it can be advantageously used in DNA computing system comprising following steps: artificially synthesizing DNA sequences in which various symbol manipulation operating systems such as logical expression and graph structure are recorded, and cutting and pasting the sequences according to protocols of molecular biological experiments, in which sequences obtained at the end of the experiments are “calculation results” of DNA computing.
The present invention is described below more specifically with reference to Example, however, the technical scope of the present invention is not limited to the following exemplification.
(DNA ASCII Code)
When the design of the ASCII code (128 letters) using DNA is considered, one DNA codeword is used for each of the letters such as alphabet. One of shorter error-correcting codes with at least 128 codes is the nonlinear (12,144,4) code (Sloane, N. J. A. and MacWilliams, F. J.: The Theory of Error-Correcting Codes. Elsevier, 1997). The above notation (12,144,4) reads ‘a length-12 code of 144 words with the minimum distance 4’ (one error-correcting, two error-detecting). By using a Max Clique Problem solver (http://rtm.science.unitn.it/intertools/) among 144 words, 32, 56, and 104 words can be selected which satisfy the length 6, −7, and −8-subword constraints, respectively. The code represented by (12,144,4) is shown in Table 7, and codewords with dagger among 144 codewords are 56 codewords satisfying the length 7-subword constraint.
There are 74 GC templates of length 12, the minimum distance 4; 31 templates among them, wherein the reverse sequence and 01 inversion are regarded as the same, are shown in Table 8. Since 128 codewords cannot be derived from a single template under the subword constraint, the pairs of templates are selected. The two pairs of templates induce mismatches in at least four positions in any ligation, and they do not share a subsequence of length 7 or longer. Such eight pairs of templates are shown in Table 9. DNA codewords prepared from these template pairs show even GC base-distribution when they are ligated. Under this condition, DNA codes derived from these templates share close melting temperatures (New Generation Computing 20, 3, 263-277, 2002).
By combining one of eight template pairs shown in Table 9 with the 56 codewords satisfying the length 7-subword constraint shown in Table 7, 112 codewords (10 of 112 codewords are shown in Tables 5 and 6) were obtained that satisfy the following conditions.
The number of codewords thus designed, 112, falls short of the 128 ASCII characters. However, some characters are usually unused in ASCII characters. For example, the values of HTML characters from  to  are not used. Therefore, the 112 codewords suffice for representing DNA ASCII code. This compromise is preferable to loosening of the constraints to obtain 128 codes.
The current status of information-encoding models using DNA was reviewed and the necessity and problems in constructing DNA codes was described. The method for designing a DNA code of the present invention can provide 112 DNA codewords of length 12 and comma-free index 4. The DNA code of the present invention considers optional concatination between codes including their complementary strands, and the DNA code has never been known until today.
Number | Date | Country | Kind |
---|---|---|---|
2003-151738 | May 2003 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP04/07271 | 5/27/2004 | WO | 8/16/2006 |