1. Field of the Invention
The present application is generally related to the synthesis of DNA molecules, and more particularly, to the synthesis of a synthetic gene or other DNA sequence.
2. Description of the Related Art
Proteins are an important class of biological molecules that have a wide range of valuable medical, pharmaceutical, industrial, and biological applications. A gene encodes the information necessary to produce a protein according to the genetic code by using three nucleotides (one codon or set of codons) for each amino acid in the protein. An expression vector contains DNA sequences that allow transcription of the gene into mRNA for translation into a protein.
It is often desirable to obtain a synthetic DNA which encodes the protein of interest. DNA can be synthesized accurately in short pieces, say 50 to 80 nucleotides or less. Pieces substantially longer than this become problematic due to cumulative error probability in the synthesis process. Most genes are appreciably longer than 50 to 80 nucleotides, usually by hundreds or thousands of nucleotides. Consequently, direct synthesis is not a convenient method for producing large genes. Currently, large synthetic genes with a desired DNA sequence are manufactured by any one of several methods:
1. If the gene does not contain introns it can be synthesized by PCR directly from genomic DNA. This is feasible for genes of bacteria, lower eukaryotes, and many viruses, but nearly all genes of higher organisms contain introns.
2. A related alternative is to PCR the gene from a full-length cDNA clone. It is time consuming and tedious to isolate and characterize a full-length clone, and full-length cDNA clones are available for only a very small fraction of the genes of any higher organism.
3. Based on the gene sequence inferred from the genomic DNA sequence, short DNA segments of both strands of a gene can be synthesized with overlapping ends. These segments are allowed to anneal and are joined together with DNA ligase. Annealing efficiency and accuracy at the segment junctions is often poor, resulting in low yields.
4. An approach to reduce this problem is to build the gene up in subsections in a step-wise manner. This remains time-consuming, expensive, tedious, and inefficient, because many reactions must be performed.
5. Based on the gene sequence inferred from the genomic DNA sequence, short overlapping duplex DNA segments of the gene can be synthesized that contain compatible end-proximal restriction endonuclease sites. Each fragment can be cut with the appropriate enzyme, annealed, and joined with DNA ligase. In addition to the limitations above, both strands of the gene sequence must be synthesized and this method is dependent on the placement of appropriate restriction sites evenly spaced throughout the gene sequence.
6. Genes have also been assembled by overlap extension of partially overlapping oligonucleotides using DNA-polymerase-catalyzed reactions. The gene is divided into oligonucleotides, each of which partially overlaps and is complementary to the adjacent oligonucleotide(s). The oligonucleotides are allowed to anneal and the resulting DNA construct is extended to the full-length double-stranded gene. See, for example, W. P. C. Stemmer et al. “Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides” Gene, 1995, 164, 49-53 and D. E. Casimiro et al. “PCR-based gene synthesis and protein NMR spectroscopy” Structure, 1997, 5, 1407-1412, the disclosures of which are incorporated by reference. Designing the oligonucleotides for gene synthesis by this approach has recently been automated as described in D. M. Hoover & J. Lubkowski “DNA Works: an automated method for designing oligonucleotides for PCR-based gene synthesis” Nucleic Acids Res., 2002, 30:10 e43, the disclosure of which is incorporated by reference. The method optimizes codon usage, optionally removes DNA hairpins, and uses a nearest-neighbor model of DNA melting to achieve homogeneous target melting temperatures. The process of removing local DNA hairpins and dimerization from a single DNA oligonucleotide is also referred to by those skilled in the art as “removing DNA secondary structure.” The methods described in these references do not globally optimize a melting temperature gap between correct hybridizations and incorrect hybridizations, however.
The present application provides a method for synthesizing a DNA sequence and the DNA sequences synthesized by the method. A preferred embodiment of the method utilizes the flexibility of the genetic code to achieve melting temperatures that optimize the simultaneous annealing of many gene segments in the desired order, facilitating the assembly of a large strand from many small ones.
In some embodiments, the likelihood of synthesizing the correct DNA sequence from a mixture of correct and incorrect gene segments is increased by determining a property of the DNA sequence or fragment thereof, or of a polypeptide or protein expressed therefrom, or of another molecule derived therefrom in order to ascertain the correctness of the DNA sequence or fragment thereof. Examples of suitable properties include the DNA sequence and the molecular weight of the polypeptide expressed therefrom.
The method in the present application is of practical utility. A number of companies currently offer synthetic gene services. Consequently, a method with improved efficiency is valuable. A partial list of such companies may be generated easily, for example, by performing an Internet search for “custom gene synthesis,” “synthetic gene services,” or related keywords.
Those skilled in the art will immediately comprehend myriad applications to which the disclosed method may be applied, including: (1) creating de novo “designer” proteins; (2) coupling to automated expression and crystallization facilities; (3) building DNA sequences predicted to express novel protein folds for structural proteomics; (4) building other DNA sequences that do not encode proteins, e.g., as RNA structural templates or DNA nanotechnology components; (5) expressing proteins from a different species in a desired expression vector according to its own codon usage preference; and (6) creating a small synthetic genome by specifying its desired protein sequences and regulatory protein binding sites.
The present application provides a recursive method for synthesizing a gene of arbitrary size, i.e., a double-stranded DNA (dsDNA) sequence that codes for a desired peptide sequence, possibly with flanking regulatory and intergenic sequences extending into other flanking genes. The disclosed method uses sequence degeneracy to achieve melting temperatures that optimize the simultaneous annealing of many gene segments in the desired order. This optimization is achieved by choosing bases and codons such that, with high probability, incorrect or wrong hybridizations melt at a low temperature and correct or right hybridizations melt at a high temperature, allowing the construction of a large synthetic gene quickly and easily.
The method comprises: (a) hierarchical assembly (b) by high-fidelity techniques such as overlap extension using proof-reading DNA polymerase, ligation, cloning, or other methods, (c) with optimization of the sequences of the component oligonucleotide pieces to facilitate preferential hybridization to the desired adjacent piece(s) and to disfavor undesired hybridizations between other pieces, for example, by exploiting the degeneracy of the genetic code or a regulatory region consensus sequence, (d) to achieve a DNA melting temperature gap between correct (high melting temperature) and incorrect (low melting temperature) hybridizations, and (e) optionally selecting pieces of DNA likely to have the correct DNA sequence for subsequent assembly steps, (f) so that, with high probability correct assemblies will form. Thus, the sequence is designed to encode its own correct self-assembly in a signal superimposed on the coding sequence by using synonymous codon substitutions in a manner friendly to the expression vector.
R. M. Horton et al. “Engineering hybrid genes without the use of restriction enzymes: gene splicing by overlap extension” Gene, 1989, 77, 61-68, the disclosure of which is incorporated by reference in its entirety, discloses overlap extension, but does not disclose optimizing overlap regions to facilitate the assembly of pieces in the desired order. In particular, Horton et al. does not disclose the use of informatics to encode the self-assembly of the gene by exploiting sequence degeneracy to achieve high melting temperatures for correct hybridizations and low melting temperatures for incorrect hybridizations.
Accordingly, the present disclosure provides a method for synthesizing a DNA sequence comprising at least the steps of: (i) dividing the DNA sequence recursively into small pieces of DNA, wherein adjacent pieces comprise overlapping regions; (ii) optimizing the sequences of the pieces of DNA resulting from each recursive division to strengthen correct hybridizations and to disrupt incorrect hybridizations; (iii) obtaining the optimized small pieces of DNA, wherein the overlapping regions of any adjacent pieces of single-stranded DNA are complementary; (iv) combining the pieces of DNA derived from the division of the next larger piece of DNA; (v) allowing the pieces of DNA to self-assemble to form a DNA construct comprising single-stranded DNA segments connected by double-stranded overlap regions; (vi) producing the next-larger piece of DNA from the DNA construct; and (vii) repeating steps (iv), (v), and (vi) in reverse order of the recursive division in step (i) to produce the DNA sequence.
The present disclosure further provides a DNA sequence, synthesized according to a method comprising at least the steps of: (i) dividing the DNA sequence recursively into small pieces of DNA, wherein adjacent pieces comprise overlapping regions; (ii) optimizing the sequences of the pieces of DNA resulting from each recursive division to strengthen correct hybridizations and to disrupt incorrect hybridizations; (iii) obtaining the optimized small pieces of DNA, wherein the overlapping regions of any adjacent pieces of single-stranded DNA are complementary; (iv) combining the pieces of DNA derived from the division of the next larger piece of DNA; (v) allowing the pieces of DNA to self-assemble to form a DNA construct comprising single-stranded DNA segments connected by double-stranded overlap regions; (vi) producing the next-larger piece of DNA from the DNA construct; and (vii) repeating steps (iv), (v), and (vi) in reverse order of the recursive division in step (i) to produce the DNA sequence.
In some embodiments, a next-larger piece of DNA produced in step (vi) comprises a mixture of DNA molecules. Some embodiments further comprise a step of selecting a DNA molecule from the mixture likely to have the correct DNA sequence and using the selected DNA molecule in the synthesis of the DNA sequence. In some embodiments, a DNA molecule is separated from the mixture by cloning. In some embodiments, the selection comprises determining a property of selected DNA molecules from the mixture, or of polypeptides expressed therefrom, and selecting a DNA molecule based on a predetermined value for the property. In some embodiments, the selection comprises sequencing a sample of DNA molecules from the mixture and selecting a DNA molecule with the correct DNA sequence. In some embodiments, the selection comprises expressing a polypeptide from each member of a sample of DNA molecules from the mixture, determining the molecular weight of the polypeptide, and selecting a DNA molecule from which a polypeptide with a predetermined molecular weight is expressed. In some embodiments, a start codon and/or stop codon is incorporated into DNA molecule from which a polypeptide is expressed. In some embodiments, the reading frame of the DNA molecule is adjusted with respect to the start codon and/or stop codon. In some embodiments, one or more stop codons is inserted into the expression vector downstream (3′) from the gene. In some embodiments, the molecular weight of the polypeptide is determined by electrophoresis.
The present disclosure further provides a method for synthesizing a DNA sequence comprising at least the steps of: (i) dividing the DNA sequence recursively into small pieces of DNA, wherein adjacent pieces comprise overlapping regions; (ii) obtaining the small pieces of DNA, wherein the overlapping regions of any adjacent pieces of single-stranded DNA are complementary; (iii) combining the pieces of DNA derived from the division of the next larger piece of DNA; (iv) allowing the pieces of DNA to self-assemble to form a DNA construct comprising single-stranded DNA segments connected by double-stranded overlap regions; (v) producing the next-larger piece of DNA from the DNA construct; (vi) selecting a next-larger piece of DNA likely to have the correct sequence; and (vii) repeating steps (iii), (iv), (v), and (vi) in reverse order of the recursive division in step (i) to produce the DNA sequence.
The present disclosure further provides a DNA sequence, synthesized according to a method comprising at least the steps of: (i) dividing the DNA sequence recursively into small pieces of DNA, wherein adjacent pieces comprise overlapping regions; (ii) obtaining the small pieces of DNA, wherein the overlapping regions of any adjacent pieces of single-stranded DNA are complementary; (iii) combining the pieces of DNA derived from the division of the next larger piece of DNA; (iv) allowing the pieces of DNA to self-assemble to form a DNA construct comprising single-stranded DNA segments connected by double-stranded overlap regions; (v) producing the next-larger piece of DNA from the DNA construct; (vi) selecting a next-larger piece of DNA likely to have the correct sequence; and (vii) repeating steps (iii), (iv), (v), and (vi) in reverse order of the recursive division in step (i) to produce the DNA sequence.
In some embodiments, the DNA sequence comprises a regulatory sequence. In other embodiments, the synthetic gene has an intergenic sequence. In other embodiments, the DNA sequence has flanking regulatory and intergenic sequences extending into other flanking genes. In other embodiments, the DNA sequence encodes a polypeptide. Preferably, the polypeptide is a portion of a full-length protein. More preferably, the polypeptide is a full-length protein. In some embodiments, the DNA sequence is a synthetic genome comprising multiple flanking encoded polypeptides, their regulatory regions, and intergenic regions.
In a preferred embodiment, the sequence of the DNA sequence is divided into small pieces of DNA in a single division. This embodiment is referred to herein as “direct self-assembly.” In another preferred embodiment, the sequence of the synthetic gene is divided into small pieces of DNA in a plurality of divisions. This embodiment is referred to herein as “recursive assembly” or “hierarchical assembly.” In one embodiment, the sequence of the synthetic gene is divided into pieces of DNA of about 1,500 bases long or shorter.
The small pieces of DNA are preferably about 60 bases long or shorter, more preferably, about 50 bases long or shorter. Preferably, the overlapping regions comprise from about 6 to about 60 base-pairs, more preferably, from about 14 to about 33 base-pairs.
In a preferred embodiment, optimization is performed by calculating a melting temperature for the pieces of DNA. Preferably, the lowest correct hybridization melting temperature is higher than the highest incorrect hybridization melting temperature. Those skilled in the art will realize that the size of the melting temperature gap is related to the annealing conditions such that a narrower gap may require more stringent annealing conditions in the reassembly step to provide the requisite level of fidelity. Consequently, the temperature gap has no minimum value. Practically, the difference between the lowest-melting correct match and the highest melting incorrect match is at least about 1° C., more preferably, at least about 4° C., more preferably, at least about 8° C., most preferably, at least about 16° C. The wider the temperature gap, the more robust the self-assembly, thereby permitting the use of less stringent annealing conditions. Those skilled in the art will appreciate that optimization may be performed using other parameters or measures related to hybridization propensity, for example, free energy, enthalpy, entropy, or other arithmetic or algebraic combinations of such parameters or measures, to achieve the same effect as melting temperature. Indeed, the melting temperature itself is one such arithmetic or algebraic combination of such parameters or measures. Consequently, in some embodiments, optimization is performed by calculating a parameter related to hybridization propensity for the pieces of DNA, for example, free energy, enthalpy, entropy, and arithmetic or algebraic combinations thereof.
In some embodiments, the pieces of DNA are optimized by permuting silent codon substitutions, for example for a portion encoding a polypeptide. In some embodiments, the pieces of DNA are optimized by taking advantage of the degeneracy in the regulatory region consensus sequence, for example for a regulatory region. In some embodiments, the pieces of DNA are optimized by adjusting boundary points between adjacent pieces of DNA. In some embodiments, the pieces of DNA are optimized by direct base assignment, for example for an intergenic region.
In a preferred embodiment, at least one of the optimized small pieces of DNA is synthetic. In another preferred embodiment, at least one of the optimized small pieces of DNA is single-stranded.
In some embodiments, a single-stranded DNA segment in the DNA construct has a length of zero bases. In some embodiments, a single stranded DNA segment has a length of from about zero bases to about 20 bases.
In a preferred embodiment, the next-larger piece of DNA is produced by cloning the DNA construct and using cellular machinery. Examples of suitable cloning methods include exonuclease III cloning, topoisomerase cloning, restriction enzyme cloning, and homologous recombination cloning. In another preferred embodiment, the next-larger piece of DNA is produced by ligating the DNA construct. In yet anther preferred embodiment, the next-larger piece of DNA is produced by extending the DNA construct by a reaction using a DNA polymerase. Preferably, the DNA polymerase is a proof-reading DNA polymerase.
In a preferred embodiment, a 3′ nucleotide in an overlapping region is G or C. Preferably, both 3′ nucleotides in an overlapping region are independently G or C. In another preferred embodiment, a 3′ nucleotide in an overlapping region is A or T.
In a preferred embodiment, a DNA polymerase primer is mixed with the pieces of DNA derived from the division of the next-larger piece of DNA. In another preferred embodiment, no DNA polymerase primer is combined with the pieces of DNA derived from the division of the next-larger piece of DNA.
In a preferred embodiment, a restriction site is designed into an overlapping region. In another preferred embodiment, the restriction site is digested with a site-specific restriction enzyme.
Some embodiments provide a method of synthesizing a DNA sequence and/or the DNA sequence synthesized therewith. The method comprises at least the steps of: (i) dividing the DNA sequence recursively into small pieces of DNA, wherein adjacent pieces comprise overlapping regions; (ii) optimizing the sequences of the pieces of DNA resulting from each recursive division to strengthen correct hybridizations and to disrupt incorrect hybridizations; (iii) obtaining the optimized small pieces of DNA, wherein the overlapping regions of any adjacent pieces of single-stranded DNA are complementary; (iv) combining the pieces of DNA derived from the division of the next-larger piece of DNA; (v) allowing the pieces of DNA to self-assemble to form a DNA construct comprising single-stranded DNA segments connected by double-stranded overlap regions; (vi) producing the next-larger piece of DNA from the DNA construct; and (vii) repeating steps (iv), (v), and (vi) in reverse order of the recursive division in step (i) to produce the DNA sequence. At least one next-larger piece of DNA comprises a mixture of DNA molecules, the mixture comprising a correct DNA sequence and a DNA sequence comprising a point deletion. The method further comprises isolating a next-larger piece of DNA from the mixture by a method comprising at least the steps of: inserting the next-larger piece of DNA into a DNA insertion site in a frameshifted vector; transforming a preselected organism with the resulting vector; selecting an organism exhibiting a predetermined phenotype; and isolating the next-larger piece of DNA from the selected organism. The frameshifted vector comprises an open reading frame comprising a gene and the DNA insertion site, the gene comprises a functional portion that encodes a functional polypeptide, the expression of which changes the phenotype of the organism, the functional portion of the gene is frameshifted such that no functional polypeptide is expressed, and the DNA insertion site is upstream of the functional portion of the gene. The next-larger piece of DNA, when inserted at the DNA insertion site, corrects the frameshift, such that the functional portion of the gene expresses a functional polypeptide.
In some embodiments, the frameshifted vector is a plasmid and the preselected organism is E. coli. In some embodiments, the change in phenotype is visually apparent.
In some embodiments, change in phenotype is color. In some embodiments, the plasmid comprises a gene for the α-complementing fragment of β-galactosidase and the preselected organism is an E. coli strain with the lacZΔM 15 genotype. In some embodiments, the transformed E. coli is grown on indicator agar comprising isopropylthio-β-D-galactoside (IPTG) and 5-bromo-4-chloro-3-indolyl-β-D-galactoside (X-Gal), and wherein the predetermined phenotype is a blue colored colony. In some embodiments, the plasmid has SEQ. ID. NO.: 661 and the preselected organism is E. coli JM109.
In other embodiments, change in phenotype is growth at a restrictive temperature. In some embodiments, the plasmid comprises a gene for valyl-tRNA synthesasets and the preselected organism is E. coli AB4141. In some embodiments, the plasmid has SEQ. ID. NO.: 667.
In some embodiments, the frameshift is a −1 frameshift. In other embodiments, the frameshift is a +1 frameshift. In some embodiments, the DNA insertion site comprises a restriction site.
In some embodiments, the next-larger piece of DNA is an intermediate fragment. In other embodiments, the next-larger piece of DNA is a full-length gene.
In some embodiments, the next-larger piece of DNA is isolated by polymerase chain reaction.
Some embodiments provide a method for isolating a piece of DNA and/or a piece of DNA isolated therewith. The method comprises at least the steps of: inserting the piece of DNA into a DNA insertion site in a frameshifted vector, transforming a preselected organism with the resulting vector, selecting an organism exhibiting a predetermined phenotype, and isolating the piece of DNA from the preselected organism. The frameshifted vector comprises an open reading frame comprising a gene and the DNA insertion site, the gene comprises a functional portion that encodes a functional polypeptide, the expression of which changes the phenotype of the organism, the functional portion of the gene is frameshifted such that no functional polypeptide is expressed, and the DNA insertion site is upstream of the functional portion of the gene. The piece of DNA, when inserted at the DNA insertion site, corrects the frameshift, such that the functional portion of the gene expresses a functional polypeptide.
In some embodiments, the frameshifted vector is a plasmid and the preselected organism is E. coli. In some embodiments, the change in phenotype is visually apparent.
In some embodiments, change in phenotype is color. In some embodiments, the plasmid comprises a gene for the α-complementing fragment of β-galactosidase and the preselected organism is an E. coli strain with the lacZΔM15 genotype. In some embodiments, the transformed E. coli is grown on indicator agar comprising isopropylthio-p-D-galactoside (IPTG) and 5-bromo-4-chloro-3-indolyl-β-D-galactoside (X-Gal), and wherein the predetermined phenotype is a blue colored colony. In some embodiments, the plasmid has SEQ. ID. NO.: 661 and the preselected organism is E. coli JM109.
In other embodiments, change in phenotype is growth at a restrictive temperature. In some embodiments, the plasmid comprises a gene for valyl-tRNA synthesasets and the preselected organism is E. coli AB4141. In some embodiments, the plasmid has SEQ. ID. NO.: 667.
In some embodiments, the frameshift is a −1 frameshift. In other embodiments, the frameshift is a +1 frameshift. In some embodiments, the DNA insertion site comprises a restriction site.
In some embodiments, the piece of DNA is an intermediate fragment. In other embodiments, the piece of DNA is a full-length gene.
In some embodiments, the piece of DNA is isolated by polymerase chain reaction.
Some embodiments provide a frameshifted vector with SEQ. ID. NO.: 661. Other embodiments provide a frameshifted vector with SEQ. ID. NO.: 667.
Some embodiments provide a method of synthesizing a frameshifted vector and/or a frameshifted vector synthesized therewith. The method comprises at least the steps of: selecting a vector comprising an open reading frame, wherein the open reading frame comprises a functional portion that encodes a functional polypeptide, the expression of which changes the phenotype of a preselected organism transformed with the vector; and inserting and/or deleting one or more bases in the open reading frame upstream of the functional portion of the gene, thereby changing the reading frame of the functional portion of the gene, thereby producing a frameshifted vector that, when transformed into the preselected organism does not express a functional polypeptide.
In some embodiments, the inserting and/or deleting comprises cutting the vector with a restriction enzyme, digesting the overhanging single-stranded portions, and religating the vector. In some embodiments, the inserting and/or deleting comprises site-directed mutagenesis.
Some embodiments provide a kit for isolating a piece of DNA with a predetermined sequence. The kit comprises at least a frameshifted vector and instructions for isolating a piece of DNA with a predetermined sequence using the frameshifted vector. The frameshifted vector comprises an open reading frame comprising a gene comprising a functional portion that expresses a functional polypeptide, the expression of which changes the phenotype of a preselected organism transformed with the vector; and a DNA insertion site upstream of the functional portion of the gene, wherein the open reading frame comprises a frameshift upstream of the functional portion of the gene, such that the vector does not express a functional polypeptide. Some embodiments of the kit comprise a frameshifted vector with a −1 frameshift and a frameshifted vector with a +1 frameshift. Some embodiments of the kit further comprise a component selected from the group consisting of the preselected organism, a restriction enzyme, a culture medium, and combinations thereof.
As used herein, the term “DNA” includes both single-stranded and doubled-stranded DNA. The term “piece” may refer to either a real or hypothetical piece of DNA depending on context. A “very large” piece of DNA is longer than about 1,500 bases, a “large” piece of DNA is about 1,500 bases or fewer, a “medium-sized” piece of DNA is about 300 to 350 bases or fewer, and a “short” piece of DNA is about 50 to 60 bases or fewer. Short pieces of DNA are also referred to as “oligonucleotides” by those skilled in the art. It will be appreciated that these numbers are approximate, however, and may vary with different processes or process variations. Although descriptions of preferred embodiments of the disclosed method that follow describe each recursive or hierarchical step as involving pieces of DNA of the same size range—for example, in which all of the pieces of DNA are very large, large, medium-sized, or short—one skilled in the art will appreciate a hierarchical step may involve DNA from more than one size range. A particular step may involve both short and medium-sized pieces of DNA, or even short, medium-sized, and large pieces of DNA.
A “small” or “short” piece of DNA is a DNA segment that can be synthesized, purchased, or is otherwise readily obtained. The term “segment” is also used herein to mean “small piece.” Those skilled in the art will understand that the term “synthon” is synonymous with the terms “small piece” and “segment” as used herein, although the term “synthon” is not used herein. The DNA segments used in the Examples that follow are synthetic; however, the disclosed method also comprehends using DNA segments derived from other sources known in the art, for example, from natural sources including viruses, bacteria, fingi, plants, or animals; from transformed cells; from tissue cultures; by cloning; or by PCR amplification of a naturally occurring or engineered sequence. As used herein, a “correct” piece of DNA is a piece of DNA with the correct or desired nucleotide sequence. An “incorrect” piece is one with an incorrect or undesired nucleotide sequence.
The disclosed method proceeds by a divide-and-conquer strategy (See Aho et al. The Design and Analysis of Computer Algorithms Addison-Wesley; Reading, Mass.: 1974, the disclosure of which is incorporated by reference in its entirety). A problem that is too large to be solved directly is broken recursively into smaller sub-problems until each is small enough to be solved directly, then the small sub-solutions are combined recursively into a solution to the original problem. Here, a particular full-length gene is too long to synthesize directly. It is broken recursively into smaller overlapping pieces until each is small enough to be synthesized. The gene is then reassembled in the reverse order of the disassembly. The smallest pieces are reassembled into the next larger pieces, which are reassembled into the next larger pieces, and so on and so forth. A reassembly step may be performed by overlap extension using a high-fidelity DNA polymerase as illustrated in
As discussed in greater detail below, a synthetic oligonucleotide is typically a mixture containing the desired oligonucleotide mixed with incorrect oligonucleotides, that is, oligonucleotides that do not have the desired sequence. As would be apparent to one skilled in the art, synthesizing a gene from such a mixture will likely produce the correct or desired gene in admixture with incorrect genes. One method for synthesizing only the correct gene is to reassemble the gene from pieces of DNA that are likely to have the correct sequences. Consequently, in some embodiments, during the reassembly process, pieces of DNA are selected that are likely to have the correct sequences for use in subsequent reassembly steps. In some embodiments, the criterion for the selection is a property of a reassembled piece of DNA or a polypeptide encoded by and expressed therefrom. In some embodiments, the criterion for the selection is a property determined from the full-length piece of DNA or polypeptide expressed therefrom. In some embodiments, the criterion for the selection is a property determined for the complementary strand of DNA or polypeptide expressed therefrom. In some embodiments, the criterion for the selection is a property determined for a piece of RNA transcribed from the piece of DNA. In some embodiments, the criterion for the selection is the phenotype of an organism into which the DNA is inserted.
Any property indicative of the correctness of the DNA sequence is useful in the method. The likelihood or probability of selecting a correct sequence depends on the property that is measured or determined. In some embodiments, the likelihood approaches certainty. An example of such a property is an experimentally determined nucleotide sequence of the piece of DNA. Any method known in the art for determining a nucleotide sequence may be used, including automated sequencing, manual sequencing, mass spectroscopic sequencing, and the like. In other embodiments, the property indicates that the piece of DNA probably has the correct sequence, but does not confirm the correctness of the sequence. An example of such a property is the molecular weight of a polypeptide encoded by and expressed from the piece of DNA. Any method known in the art for determining the molecular weight of a polypeptide or protein may be used, including gel electrophoresis, mass spectroscopy, and the like.
The term “PCR” as used herein in the context of assembling or reassembling DNA is a PCR or overlap extension reaction, preferably using a proof-reading DNA polymerase (“proof-reading PCR”). The term “direct self-assembly” as used herein in the context of assembling or reassembling DNA is a copy-free method of producing a DNA construct or a DNA construct produced by the method, comprising assembling a large piece of DNA from short synthetic segments in a single step. “Copy-free” means that the method lacks a copy step, such as is found in overlap extension or PCR, thus eliminating the copying errors. In a preferred embodiment, adjacent segments on the same strand abut, i.e., form a nick in the strand. Preferably, the nicks in the self-assembly are repaired by in vitro ligation. In another preferred embodiment, the nicks are repaired in vivo by cellular machinery after cloning.
The method comprises (a) hierarchical assembly (b) by high-fidelity techniques such as overlap extension using proof-reading DNA polymerase, ligation, cloning, or other methods, (c) with optimization of the sequences in the component oligonucleotides to facilitate preferential hybridization to the desired adjacent piece(s) and to disfavor undesired hybridizations to other pieces, for example, by exploiting the degeneracy of the genetic code or a regulatory region consensus sequence, (d) to achieve a DNA melting temperature gap between correct (high melting temperature) and incorrect (low melting temperature) hybridizations, (e) optionally selecting pieces of DNA likely to have the correct DNA sequence for use in the subsequent assembly steps, (f) so that, with high probability correct assemblies will form.
Hierarchical assembly ensures that the complexity at any step, and therefore the possible number of incorrect assemblies, is bounded and manageably small. Overlap extension allows every correctly hybridized 3′ end to be extended reliably to a complementary copy of the prefix of its match. Using a high-fidelity DNA polymerase reaction ensures that the copy, with high probability, is correct. Ligation and cloning achieve the same result as overlap extension while avoiding the small but non-zero error rate associated with DNA polymerase reactions. The genetic code degeneracy permits flexibility in silent codon substitutions to strengthen correct matches and disrupt incorrect ones. A broad temperature gap between the highest-melting incorrect hybridization and the lowest-melting correct hybridization means that, with high probability, most incorrect ones have melted and most correct ones have annealed. Selecting gene fragments likely to be correct means that oligonucleotide sequence errors arising from chemical synthesis will be removed or reduced. Consequently, each reassembly reaction, with high probability, produces a correct larger piece of DNA. Errors will occur, of course, but with low probability. Consequently, two or more compensating errors that yield a product of the same molecular weight—i.e., the same band in the final gel—or two or more compensating deletions that provide a product of the same reading frame—i.e., the same or nearly the same encoded amino acid sequence—would correspond to a doubly rare or rarer event.
In optimizing the base sequences, theoretical melting temperatures are calculated for all possible correct and incorrect hybridizations by methods known in the art, for example, using Mfold. Such methods are disclosed, for example, in M. Zuker et al. “Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide.” in RNA Biochemistry and Biotechnology, Barciszewsld & Clark, eds.; Kluwer: 1999; D. H. Mathews et al. “Expanded sequence dependence of thermodynamic parameters provides robust prediction of RNA secondary structure” J. Mol. Biol., 1999, 288, 910-940; J. Santa-Lucia “A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics” Proc. Natl. Acad. Sci. USA, 1998, 95, 1460-1465; the disclosures all of which are incorporated by reference in their entireties. The figure of merit is the gap between the lowest-melting correct match and the highest-melting incorrect match. Examples of incorrect matches include: (a) hairpins, in which a short segment folds back and hybridizes to itself; (b) dimers, in which a short segment is partially self-complementary; (c) intersegment mismatches, in which part of one short segment is partially complementary to part of a second; (d) and shifted correct matches, in which a misaligned overlap region is partially complementary to another region within the same overlap. Accordingly, in some embodiments, optimization comprises calculating a melting temperature for a single piece of DNA, for example, for a hairpin, in some embodiments, optimization comprises calculating a melting temperature for two pieces of DNA, for example, for an intersegment mismatch. In some embodiments, optimization comprises calculating both types of melting temperatures.
The melting temperature gap is widened by perturbations to the codon assignments, including strengthening correct matches by increasing G-C content in the overlaps and disrupting incorrect matches by choosing non-complementary bases. Codon assignments are varied and the process repeated until the gap is comfortably wide. This process may be performed manually or automated. In a preferred embodiment, the search of possible codon assignments is mapped into an anytime branch and bound algorithm developed for biological applications, which is described in R. H. Lathrop et al. “Multi-Queue Branch-and-Bound Algorithm for Anytime Optimal Search with Biological Applications” in Proc. Intl. Conf. on Genome Informatics, Tokyo, Dec. 17-19, 2001 pp. 73-82; in Genome Informatics 2001 (Genome Informatics Series No. 12), Universal Academy Press, the disclosure of which is incorporated by reference. Those skilled in the art will recognize that other optimization methods could be used, e.g., simulated annealing, genetic algorithms, other branch and bound techniques, hill-climbing, Monte Carlo methods, other search strategies, and the like. Those skilled in the art will further realize that optimizing, i.e., weakening, incorrect matches is functionally equivalent to optimizing, i.e., strengthening, correct matches, and vice versa. Consequently, suitable optimization methods include weakening incorrect matches, strengthening correct matches, and any combination thereof.
Those skilled in the art will realize that the size of the melting temperature gap is related to the annealing conditions such that a narrower gap may require more stringent annealing conditions in the reassembly step to provide the requisite level of fidelity. Consequently, the temperature gap has no minimum value. In some embodiments, the temperature gap is greater than 0° C., at least about 1° C., at least about 2° C., at least about 3° C., at least about 4° C., at least about 5° C., at least about 6° C., at least about 7° C., at least about 8° C., at least about 9° C., at least about 10° C., at least about 12° C., at least about 14° C., at least about 16° C., at least about 18° C., or at least about 20° C. Those skilled in the art will understand that, under appropriate annealing conditions, the temperature gap is arbitrarily close to 0° C. Practically, the difference between the lowest-melting correct match and the highest melting incorrect match is at least about 1° C., more preferably, at least about 4° C., more preferably, at least about 8° C., most preferably, at least about 16° C. The wider the temperature gap, the more robust the self-assembly, thereby permitting the use of less stringent annealing conditions.
Those skilled in the art will realize that the temperature gap may be increased by optimizing the division process. For example, the division may be optimized through a nested search strategy, described below. Other appropriate search strategies will be apparent to those skilled in the art.
In an outer search, a temperature variable is initialized to a high temperature, for example, 80° C., then decremented by, for example, one degree at each outer search step. At each outer search step, an inner search is called to see whether it is possible to divide the gene into short pieces such that no overlap region has a melting temperature lower than the current setting of the temperature variable. When the inner search finally succeeds, its corresponding codon assignments are returned from the outer search.
The inner search proceeds via a depth-first search through the possible division points that meet the design constraints, for example, minimum overlap lengths, maximum segment lengths, possible C-G clamps at segment boundaries, and the like. At each inner search step, the overlap region resulting from the most recent candidate division point is used to generate the set of codon assignments that yield the highest self melting temperature for that region. If the highest self-melting temperature is less than the current setting of the temperature variable the inner search fails. Otherwise it continues to the next step.
While the general DNA melting problem is non-linear because DNA secondary structure can cause non-local elements of the sequence to hybridize, correct matches at the overlap regions are characterized by linear hybridizations among purely local bases. Consequently, for a given subsequence of the gene, linear dynamic programming techniques based on base pairs (nearest neighbors) are used to determine the codon assignment to that subsequence that maximizes its melting temperature.
Thus, the gene is divided into short pieces by finding the division points that maximize the melting temperature of the lowest-melting overlap region, while respecting the design constraints, for example, minimum overlap lengths, maximum segment lengths, possible C-G clamps at segment boundaries, and the like. The codon assignments corresponding to the maximal division points are a better starting codon assignment at the beginning of the design process than are the most common codons.
Those skilled in the art will appreciate that optimization may be performed using other parameters or measures related to hybridization propensity, for example, free energy, enthalpy, entropy, or other arithmetic or algebraic combinations of such parameters or measures, to achieve the same effect as melting temperature. Melting temperature itself is one such arithmetic or algebraic combination of such parameters or measures.
The disclosed method comprises a design or decomposition process and a synthesis or reassembly process. In a preferred embodiment, the synthetic gene is designed according to a method illustrated as method 300 in
Step 302. The synthetic gene is divided as follows. If the synthetic gene is very large, the DNA sequence is optionally divided into two or more large pieces of DNA of roughly equal size. Adjacent pieces of DNA preferably overlap by a number of nucleotides appropriate to facilitate reassembly. The extent of overlap depends on factors including the particular base sequence, method of reassembly, temperature, and salt concentration, and may be determined by the skilled artisan without undue experimentation. The adjacent pieces of DNA are designed for reassembly by any method or combination of methods known in the art for joining DNA molecules, for example, by ligation or by overlap extension. In some embodiments, the division is optimized to produce pieces of DNA that are more likely to assemble into the desired DNA sequence, as described in greater detail below.
In a preferred embodiment, each large piece is dsDNA and overlaps the adjacent large piece by at least the width of a restriction site (typically from about four to about six bases). Preferably, a restriction site that does not appear elsewhere in either piece is engineered into the DNA sequence of each resulting piece by exploiting the degeneracy of the genetic code as described in greater detail below. Adjacent large pieces are reassembled by cutting with the appropriate restriction enzyme, annealing the adjacent pieces together, and ligating the cut ends together.
In other embodiments, the large pieces of ssDNA or dsDNA are designed such that adjacent pieces of DNA on the same strand abut without a gap, as illustrated in
In another preferred embodiment, each large piece is ssDNA or dsDNA and directly abuts the adjacent large piece. In this embodiment, primers are then constructed that overlap the abutting ends of the large pieces by from about 25 to about 30 bases or greater, and the abutting large pieces are reassembled by overlap extension.
In yet another preferred embodiment, each large piece is dsDNA with overlapping regions comprising from about 25 to about 30 bases or greater of complementary ssDNA. In this embodiment, the adjacent large pieces are reassembled by hybridization and ligation. In another embodiment, the ssDNA overlaps are designed to leave single-stranded regions flanking the double-stranded overlap region. In this embodiment, the adjacent large pieces are reassembled by overlap extension.
In still another embodiment, each resulting piece is ssDNA and the overlapping regions of adjacent resulting pieces comprise from about 25 to about 33 bases or greater of complementary single-stranded DNA. In some preferred embodiments, the overlapping regions comprise about 75 bases. In this embodiment, the adjacent resulting pieces are reassembled by overlap extension.
In certain embodiments, the large or very large pieces of DNA may be reassembled by cloning using a vector of any type known in the art. Examples of suitable vectors include without limitation, plasmids, cosmids, phagemids, viruses, chromosomes, bacterial artificial chromosomes (BAC), or synthetic chromosomes.
Embodiments using ligation or cloning are preferred in situations in which using DNA polymerase is disfavored, for example, where the DNA is greater than about 3 KB to about 5 KB long.
The method for designing a piece of DNA is described for a large piece of DNA, but is applicable to pieces of DNA of any size. The piece of DNA is divided into overlapping short pieces of DNA that are readily available—that is, small pieces or segments. Preferably, each short piece is small enough to be synthesized readily. This large piece of DNA could be the target DNA sequence or a large piece of DNA derived from the division of a very large piece of DNA as described above.
In a preferred embodiment, the large piece of DNA is designed for “direct self-assembly.” In this embodiment, each large piece is divided into from about 50 to about 60 overlapping small pieces of from about 50 to about 60 bases or fewer. Preferably, the adjacent small pieces of DNA from the same strand abut, i.e., hybridize to form a DNA construct with no gaps between the pieces. In this embodiment, the large piece of DNA is preferably reassembled by ligation (“direct self-assembly and ligation”). In an embodiment in which adjacent small pieces from the same strand do not abut, i.e., hybridize to form a DNA construct with single-stranded gaps between the double-stranded overlaps, the large piece of DNA is preferably reassembled by overlap extension. In another embodiment, adjacent small pieces from the same strand abut, i.e., hybridize to form a DNA construct with no single-stranded gaps between the double-stranded overlaps, and the large piece of DNA is reassembled by overlap extension. In another embodiment, a DNA construct with a combination of gaps and no gaps is reassembled by overlap extension. In another preferred embodiment, the large piece of DNA is reassembled by cloning in an expression vector. In this embodiment, the ends of the DNA construct may have any combination of gaps and no gaps. Preferably, the ends of the large piece of DNA are adapted for insertion into an expression vector, for example, complementary to a restriction site in the expression vector.
In another preferred embodiment, the large piece of DNA is designed for “recursive assembly” or “hierarchical assembly.” In this embodiment, a large piece of DNA is divided first into about overlapping medium-sized pieces of DNA. In some embodiments, the large piece of DNA is divided into about three to about ten medium-sized pieces of DNA, preferably, about five to about 7 pieces. Each medium-sized piece is then subdivided into overlapping small pieces of DNA, preferably, from about six to about 12 pieces. As described above for direct self-assembly, the DNA pieces at each level of recursion may be designed for reassembly by any combination of methods, including ligation, overlap extension, or cloning. In a preferred embodiment, the DNA pieces are reassembled by overlap extension. One skilled in the art will realize that direct self-assembly is a special case of recursive assembly in which the large piece of DNA is divided in a single step.
Step 304. For the resulting pieces of DNA at each level of recursion, the sequences are optimized to strengthen correct matches (the overlap regions between adjacent pieces of DNA) and to disrupt incorrect matches (all other hybridizations). For example, a DNA sequence in a coding region may be optimized by taking advantage of the genetic code degeneracy. A DNA sequence in a regulatory region may be optimized by taking advantage of the degeneracy in the regulatory region consensus sequence. A DNA sequence outside a coding or regulatory region, i.e., in an intergenic region, may be optimized by direct base assignment.
Some embodiments use no or limited sequence optimization. For example, changes in a nucleotide sequence can change the secondary structure of DNA and RNA. Changes in the secondary structure in RNA viral genomes can affect the viability of the viruses. In some embodiments, no sequence optimization is performed. In other embodiments, selected sequences are optimized as described above and other sequences are not.
In some embodiments, the division described in step 302 is optimized to increase the probability that the pieces of DNA will reassemble into the desired DNA sequence. The boundary points between adjacent pieces of DNA are adjusted to create or to increase a temperature gap, or to disrupt other incorrect hybridizations, for example, hairpins.
Reassembly or synthesis of the synthetic gene is the formal reverse of the division process described in steps 302 and 304.
Step 306. Obtain the optimized small pieces of DNA. Typically, the small pieces of DNA are synthetic. In a preferred embodiment, the small pieces of DNA are single-stranded and overlapping portions of adjacent pieces are complementary.
Step 308. Combine the pieces of DNA derived from one division of each piece of DNA. In a recursive assembly process, the small pieces derived from each medium-sized piece are combined in this step. In the next recursive assembly cycle, the resulting medium-sized pieces derived from each large piece are combined in this step, and so on and so forth. In a direct self-assembly process, the small pieces derived from the large piece of DNA are combined in this step.
Step 310. Allow the DNA segments to self-assemble- to form a DNA construct of ssDNA segments connected by double-stranded overlap regions. In embodiments in which pieces of DNA are double-stranded, the pieces are preferably first denatured. Embodiments using overlap extension to reassemble a piece of DNA have single-stranded gaps between the double-stranded overlap regions. Preferably, the single-stranded gaps are from about zero to about 20 bases long. Embodiments using ligation to reassemble a piece of DNA have single-stranded gaps of length zero (i.e., no gap, a nick in the DNA) and the double-stranded overlap regions abut each other. Embodiments using cloning to reassemble a piece of DNA have any combination of gaps and no gaps.
Step 312. Extend the DNA construct to full-duplex dsDNA. In embodiments with single-stranded gaps between the double-stranded overlaps, extension is accomplished using overlap extension, preferably, using a high-fidelity DNA polymerase reaction. In embodiments with no gaps between the double-stranded overlaps, extension is accomplished by ligation. In another preferred embodiment, the self-assembled construct is cloned into an expression vector, and the extension to full-duplex dsDNA is performed by the cellular machinery.
Some embodiments use ssDNA in subsequent steps. ssDNA is produced from the dsDNA using any method known in the art, for example, by denaturing or using nicking enzymes. In some embodiments, the DNA is cloned into a vector that produces ssDNA, for example, bacteriophage M13 or a plasmid containing the M13 origin of DNA replication. M13 is known to roll-off ssDNA into the medium.
Step 314. In some embodiments, a property indicative of the likelihood of correctness of the resulting piece of DNA is optionally determined as disclosed in greater detail below. Pieces of DNA likely to have the correct sequence are selected for subsequent reassembly steps. In some embodiments, the property is determined after the synthetic gene is fully reassembled in order to select a synthetic gene likely to have the correct sequence. In some embodiments, the sequence of the selected synthetic gene is confirmed by sequencing.
Step 316. Repeat steps 308-314 in reverse order of the division in step 312 to produce the synthetic gene. Those skilled in the art will appreciate that the pieces of DNA identified in the division process may each be synthesized by a different reassembly method. For example, one piece may be synthesized by overlap extension, while a second is synthesized by ligation, and these two pieces reassembled by cloning into an expression vector.
In a preferred embodiment, the disclosed method takes advantage of the fact that the genetic code is sufficiently degenerate to allow codons to be assigned so that, with high probability, wrong hybridizations melt at lower temperatures and correct hybridizations melt at higher temperatures. Consequently, there is an intermediate temperature range within which, with high probability, the product that does form is mostly correct. Because errors occur with low probability, two or more compensating errors that yield a product with the correct molecular weight—i.e., the same band in the final gel—or two or more compensating deletions that yield a product of the same reading frame—i.e., the same or nearly the same encoded amino acid sequence—would correspond to a doubly rare or rarer event.
The final primers, or intermediate primers, or other flanking sequences may contain sequences that allow the synthesized gene to be inserted into an expression vector using various well-known cloning methods, for example, restriction enzymes, homologous recombination, exonuclease cloning, or other methods known in the art, allowing one to build a large synthetic gene quickly and easily.
A problem in some embodiments of the disclosed method is that a synthetic oligonucleotide, or small piece of DNA, typically contains a mixture of the desired DNA sequence (“full-length oligonucleotide”) contaminated with sequences with internal point deletions. This problem is referred to herein as the “N−1” problem because oligonucleotides with a single point deletion (“N−1 oligonucleotides”) are the most common contaminant in a typical chemical synthesis of oligonucleotides. Furthermore, in some embodiments, the N−1 oligonucleotides are the most problematic because they are more likely to hybridize, and consequently, to provide undesired products, than oligonucleotides with more than one point deletion or mutation. When this mixture of oligonucleotides is used to synthesize medium and large pieces of DNA as disclosed herein, the product pieces of DNA contain a population containing DNA with the desired sequence as well as DNA with errors arising from incorporation of the N−1 oligonucleotides. The N−1 oligonucleotide errors are cumulative and may cause frame-shift mutations, as understood by those skilled in the art.
In the chemically synthesized oligonucleotides, the typical coupling efficiency for each nucleotide is from about 98% to about 99.5%, or greater. TABLE I provides the yield of the desired full-length oligonucleotide of length 20 to 250 nt for coupling efficiencies of 99.5%, 99%, and 98%. These results are provided graphically in
Even in cases in which the desired synthetic gene is synthesized with high probability of correct oligonucleotide order, the desired gene is invariably mixed with many defective genes arising from N−1 oligonucleotides. In many applications, this mixture of correct and defective genes is undesirable. Accordingly, disclosed below is a method for improving the probability of synthesizing the desired gene and/or selecting the desired gene from this mixture.
In some embodiments, the N−1 problem is addressed by assembling the chemically synthesized oligonucleotides using direct self-assembly and ligation, as described above and illustrated in
In some embodiments, the N−1 problem is addressed by sampling the population of synthetic DNA molecules and sequencing the sampled molecules. In some embodiments, a random sample from the population of different DNA molecules produced in any of the reassembly steps, including the final step, is sequenced and only those molecules with the correct nucleotide sequence are used in the next reassembly step. The optimum sample size is related to the probability of synthesizing the desired DNA molecule. For example, a synthesis of a 200-nt oligonucleotide or intermediate fragment with a 99.5% coupling efficiency provides about 37% of the correct oligonucleotide. Randomly selecting four oligonucleotides or intermediate fragments from the product mixture provides about an 84% chance of selecting at least one correct oligonucleotide. For a 300 nt oligonucleotide or intermediate fragment at 99.5% coupling efficiency, the correct oligonucleotide makes up about 22% of the product. The probability of selecting at least one correct oligonucleotide or intermediate fragment from a sample of four oligonucleotides from this mixture is about 63%. The probabilities of selecting at least one correct oligonucleotide or intermediate fragment using sample sizes of 1, 4, 6, and 8 for syntheses with coupling efficiencies of 99.5% and 99.7% and oligonucleotide lengths of 250 nt, 300 nt, and 300 nt are provided in TABLE II. As shown in TABLE II, only a modest amount of sequencing is necessary to provide a good probability of selecting a correct oligonucleotide or intermediate fragment.
In some embodiments, sampling is performed by cloning the DNA-to-be-sequenced into a suitable vector. Typically, each transformed colony corresponds to one molecule of the synthetic DNA. In some embodiments, a sample of transformed colonies are selected, the DNA sequenced, and DNA with the correct sequence is used in the next hierarchical stage of assembly. The cloning is any type of cloning known in the art. In one embodiment, the cloning is topoisomerase I (TOPO®, Invitrogen) cloning.
The sampling is performed at any of the hierarchical reassembly stages. For example, in some embodiments, an oligonucleotide is sequenced after chemical synthesis. In some embodiments, oligonucleotides or intermediate fragments are assembled into a medium-sized piece of DNA, which is then sequenced. In some embodiments, medium-sized pieces of DNA are assembled into a large piece of DNA, which is then sequenced. In some embodiments, the sampling and sequencing are performed on the medium- or large-sized pieces of DNA, which are synthesized by direct self-assembly and ligation.
In some embodiments, the N−1 problem is addressed by analyzing the polypeptide(s) expressed from a sample from the population of synthetic DNA sequences. The DNA is expressed using any means known in the art, for example, inserting the gene in an expression vector or using a cell-free expression system. In some embodiments, an organism is transformed by electroporation or using a gene gun. In some embodiments, the DNA sequence is cloned in an expression vector and expressed. As discussed above, each clone typically corresponds to one DNA molecule from the population. In some embodiments, the DNA is the full-length synthetic gene. In other embodiments, the DNA is an intermediate fragment. In the case of an intermediate fragment, those skilled in the art will realize that, in some embodiments, the intermediate fragment is designed with (1) a leader that provides a start codon in the correct reading frame, that is, provides an ATG in the DNA and a 0-2 nt filler that adjusts the reading frame in order to express the desired polypeptide, and (2) a trailer that provides one or more stop codons (TAA, TAG, or TGA) in the DNA and a 0-2 nt filler that adjusts the reading frame in order to terminate the desired polypeptide. Typically, the reading frame is the same as for the full-length synthetic gene, although other reading frames are used in some embodiments. Typically, from zero to two bases are inserted into the leader and trailer for adjusting the reading frame. Those skilled in the art will recognize that more than two bases could be used to adjust the reading frame. For example, in some embodiments, the leader and/or trailer encodes additional amino acids, restriction sites, or control sequences. Those skilled in the art will further realize that, in some embodiments, different leaders and/or trailers are used in conjunction with the same piece of DNA in different steps of the method. For example, in some embodiments, the leader and/or trailer used in the expression of a polypeptide from a piece of DNA is different from the leader and/or trailer used in the assembly of that piece of DNA. Some embodiments, provide one or more stop codons downstream (3′) of the gene in order to stop the translation of DNA fragments constructed from one or more N−1 oligonucleotides. In some embodiments, the stop codons are engineered into the expression vector. Some embodiments include at least three stop codons downstream (3′) of the gene, at least one of each in each of the three possible reading frames. Some embodiments use groups of stop codons instead of single stop codons in each reading frame.
A polypeptide expressed from a clone with an N−1 defect will be defective. The expressed peptide is analyzed using any means known in the art, for example, gel electrophoresis, capillary electrophoresis, two-dimensional electrophoresis, isoelectric focusing, spectroscopy, mass spectroscopy, NMR spectroscopy, chemically, ligand binding, enzymatic cleavage, or a functional or immunological assay. A clone that expresses the correct peptide is free from N−1 defects.
In some embodiments, the expressed polypeptide is analyzed using gel electrophoresis, which separates polypeptides by molecular weight. Of the 64 possible DNA codons, 3 are stop codons. Consequently, the frame-shift caused by a point deletion is likely to generate a new stop codon, resulting in a prematurely truncated polypeptide, the molecular weight of which is determined using gel electrophoresis. A clone that provides a full-length polypeptide is likely to have the desired sequence, while one that provides a truncated polypeptide is likely to have at least one point deletion. In some embodiments, a clone with an N−1 defect or defects produces a polypeptide that is too long, because the N−1 defect results in a frame-shift that causes the terminating stop codon(s) to be ignored (read through). In some embodiments, such a polypeptide that is too long will be terminated by a stop codon engineered into the expression vector downstream (3′) of the gene. As discussed above, some embodiments comprise three groups of stop codons, one group in each possible reading frame. In these embodiments, the molecular weight of the expressed polypeptide is higher than expected.
In some embodiments, analysis of the expressed polypeptide is used to narrow the sample of clones that are then sequenced. In these embodiments, the analysis of the expressed polypeptide is used to identify and to eliminate clearly defective (e.g., truncated or too long) DNA clones. The remaining clones are then sequenced. In these embodiments, the expression and analysis is a semi- or nonrandom selection method, in contrast to the random selection method described above. In some embodiments, the expressed polypeptide is analyzed by gel electrophoresis. In some cases, gel electrophoresis does not distinguish a defective polypeptide from the correct polypeptide. For example, in some cases a DNA sequence with an N−1 defect generates a defective polypeptide that, to within the resolution of the electrophoresis conditions, has the same molecular weight as the correct polypeptide. This scenario can arise where the defective DNA sequence fortuitously expresses a defective polypeptide similar in molecular weight to the correct polypeptide, for example, where the point defect is near the end of the clone. In another scenario, the clone has 3N point deletions that do not generate a new stop codon. As discussed above, the defective polypeptide is most likely shorter than the correct polypeptide. A defective polypeptide closer in molecular weight to the correct polypeptide than the resolution of the electrophoresis experiment is not distinguished. Given the resolving ability of gel electrophoresis, selecting a correct clone using the method is highly probable. The probability is further improved using an analytical technique with higher resolution, for example, capillary electrophoresis or mass spectroscopy. In some cases, all of the clones selected for sequencing in the gene expression screen have the correct sequence, indicating the reliability of this selection method. Furthermore, expressing a gene and determining the molecular weight of the expressed polypeptide is typically faster and/or less expensive than the equivalent amount of DNA sequencing. In some embodiments intermediate fragments are selected by estimating the molecular weight of the expressed polypeptide only, and DNA sequencing is reserved only for the final gene construct, and even then only after its molecular weight of a polypeptide expressed from the final gene has been estimated to be correct.
In some embodiments, all of the expressed polypeptides that are analyzed are defective, for example, truncated. In these embodiments, an analysis of the defective polypeptides indicates the location of the defect in the DNA sequence. The gene is then resynthesized using this information. In embodiments using multiple hierarchical synthesis steps, only some of the pieces of DNA are resynthesized, for example, an intermediate fragment containing the defect. In some embodiments, the offending fragment is divided in a different way and/or reoptimized, as discussed above. In some embodiments a different clone is chosen to replace the offending fragment.
In some embodiments, the DNA sequence is transformed into a selected organism such that a clone containing a correct and/or likely-to-be correct DNA sequence will exhibit a phenotype different from the phenotype exhibited by a clone containing an incorrect DNA sequence. The organism is a prokaryote or a eukaryote. Examples of suitable prokaryotes include bacteria, for example, E. coli. Examples of suitable eukaryotes include yeast, fungi, and mammalian cells. The differences in phenotype arise from mechanisms well known in the art, for example, differential induction and/or repression of gene expression by correct and incorrect DNA sequences, expression of different proteins or polypeptides by the correct and incorrect DNA sequences, and the like. In some embodiments, the difference in phenotype is detectable without specialized equipment, for example, by inspection by the naked eye. For example, in some embodiments, the difference in phenotype is the color of the organism. In other embodiments, the difference in phenotype is viability of the organism under particular conditions. Examples of particular conditions include pH, temperature, light, and the like. In some embodiments, conditions include the presence or absence of a particular compound or compounds, including, nutrients, for example, amino acids, carbohydrates, vitamins, cofactors and the like; and/or antibiotics or other toxic compounds. Those skilled in the art will understand that other particular conditions are compatible with the disclosed method. Those skilled in the art will understand that the embodiments described below are exemplary only, and that the method may be varied to use, for example, other organisms, phenotypes, conditions, genes, and vectors.
Some embodiments of the method use a vector referred to herein as a “frameshifted vector” to distinguish between correct and incorrect DNA sequences. The frameshifted vector is any type of vector known in the art useful for introducing a gene into an organism, for example, a plasmid, a cosmid, a phagemid, a bacteriophage, a virus, and/or a bacterium. The frameshifted vector contains a gene, that when expressed, changes the phenotype of a selected organism into which the vector is transformed, for example, color or viability. In the frameshifted vector, a frameshift is introduced into at least a portion of the open reading frame (ORF) of the gene such that the gene does not express a functional product. In some embodiments, the frameshift is introduced upstream of a functional portion of the gene. A functional portion of the gene is a portion that expresses a functional polypeptide or protein, the expression of which changes the phenotype of the organism. When an organism is transformed using the frameshifted vector, no functional product is expressed, and consequently, no change in phenotype is observed. The term “frameshift” as used herein is used in its usual sense, as well as to mean the insertion and/or deletion of one or more bases, which results in a change in reading frame. The three possible reading frames for the gene are referred to herein as the correct reading frame; the +1 or n+1 reading frame; and the −1 or n−1 reading frame. In a frameshifted vector, at least a portion of the ORF is in the −1 or +1 reading frame. Those skilled in the art will understand that for any particular vector/organism system, two frameshifted vectors, one with a −1 reading frame and one with a +1 reading frame are sufficient to perform the disclosed method with any synthetic DNA sequence designed to shift the downstream reading frame. In some embodiments, the frameshifted vector also includes at least one DNA insertion site upstream of the region of the gene that encodes the functional portion of the protein or peptide. The DNA insertion site is of any type known in the art useful for inserting a piece of DNA into the frameshifted vector. In some embodiments, the DNA insertion site is one or more restriction sites. Those skilled in the art will understand that any compatible restriction site known in the art is suitable. Examples of suitable restriction sites include, EcoR I, BamH I, Hind III, Pci I, Age I, Spe I, Nde I, Nco I, Sac I, Sac II, Pvu I, Xho I, Pst I, and Sph I. Some embodiments of the frameshifted vector comprise a plurality of DNA insertion sites.
The synthetic DNA sequence is designed to correct the frameshift when inserted into the DNA insertion site. In other words, the DNA sequence is designed with a length that corrects the −1 or +1 shift designed into the reading frame of the functional portion of the gene. A piece of DNA with the desired sequence is also referred to herein as having a “correct” DNA sequence. Consequently, the functional portion of the gene is in the correct reading frame when a correct DNA sequence is inserted therein. On transforming the selected organism with the resulting vector, the organism expresses functional polypeptide or protein, which changes the phenotype of the organism.
In contrast, when a DNA sequence with an N−1 defect is inserted into the DNA insertion site, the frameshift is not corrected. For example, for a vector with a −1 frameshift, inserting an N−1 DNA sequence produces a +1 frameshift. Inserting a DNA sequence with two N−1 defects also does not correct the frameshift. For a vector with a −1 frameshift, inserting a DNA sequence with two N−1 defects produces a −1 frameshift. Inserting a DNA sequence with three N−1 defects corrects the frameshift. In general, a DNA sequence with 3n N−1 defects will correct the frameshift, while those with 3n+1 or 3n+2 N−1 defects will not. Given the low error rates for incorporation of N−1 oligonucleotides in the synthesis of the next-larger piece of DNA provided above, especially for ligation-based methods, the probability that a DNA sequence will have three or more N−1 defects is low, although not negligible. Consequently, most of the DNA sequences with the correct reading frame in the frameshifted vector have the correct sequence. In some embodiments for selecting intermediate fragments, about 80% to about 95% of clones exhibiting the changed phenotype have the correct DNA sequence. The remainder have three or more N−1 defects, which is consistent with the error rates provided above.
Similarly, when an organism is transformed with a frameshifted vector into which no DNA is inserted, the frameshift engineered into the vector causes no functional product to be expressed, and consequently, no change in phenotype. Of the four most likely DNA inserts into the frameshifted vector—no DNA, DNA with one N−1 defect, DNA with two N−1 defects, and correct DNA—only the vector with the correct DNA inserted therein changes the phenotype of an organism transformed therewith. An organism exhibiting the changed phenotype is selected, the correct DNA sequence isolated, and the DNA sequence used as described herein. The correct DNA sequence is isolated by any method known in the art, for example, by excision or by PCR using suitable primers.
A frameshifted vector is synthesized by any method known in the art. Some embodiments use known combinations of a particular vector and organism such that the organism changes phenotype when transformed with the vector. Typically, the vector has an open reading frame containing a functional portion of a gene, the expression of which changes the phenotype of the organism. One or more bases are inserted and/or deleted in the ORF upstream of the functional portion of the gene, thereby causing a −1 or +1 frameshift in the functional portion of the gene. The bases are inserted and/or deleted by methods known in the art, for example, cutting with restriction enzymes, digestion of double- or single-stranded portions, site-directed mutagenesis, ligation, chemical synthesis, and the like. In some embodiments, a DNA insertion site is also engineered upstream of the functional portion of the gene. In other embodiments, the ORF contains a preexisting DNA insertion site upstream of the functional portion of the gene.
Some embodiments distinguish a correct DNA sequence from an incorrect DNA sequence by the color of the transformed organism. An embodiment of the method uses a vector with a gene encoding the α-complementing fragment (α-fragment) of E. coli lacZ β-galactosidase. The DNA sequence is inserted at the DNA insertion site located upstream of the functional portion of the gene for the α-complementing fragment. The vector is engineered so that functionality of the α-fragment gene depends on the reading frame of the functional portion of the gene after the synthetic DNA sequence is inserted into the DNA insertion site. The vector is transformed into an E. coli strain containing a 5′-truncation of the lacZ gene. Protein expressed by a functional α-fragment gene transcomplements the defective lacZ expressed by the cell, thereby producing functional β-galactosidase. When the cells are grown on indicator media containing isopropylthio-β-D-galactoside (IPTG) and 5-bromo-4-chloro-3-indolyl-β-D-galactoside (X-Gal), colonies developing from cells with a functional α-complementing fragment gene are blue, while those with a defective α-complementing fragment gene are white.
In one embodiment, the frameshifted vector is a modified pGEM®-3Z vector (Promega Corp., Madison, Wis.). The pGEM®-3Z vector is a pBR322-based plasmid that contains a multiple cloning site (MCS) in the ORF of the α-fragment gene, as well an ampicillin resistance gene. The vector is engineered with a frameshift mutation in the α-fragment gene, which renders the gene non-functional. The DNA sequence is designed to correct the frameshift when inserted at the MCS, thereby producing a functional α-fragment gene. Colonies of cells transformed with the frameshifted vector are white. White colonies are also observed for cells transformed with a frameshifted vector into which a DNA sequence with one or two N−1 defects is inserted. Blue colonies are observed only for cells transformed with a frameshifted vector into which a DNA sequence with no defects is inserted. A feature of this system is that the α-fragment is known to retain its activity with up to 650 amino acid extensions at the N-terminus.
Is some embodiments, the difference in phenotype is temperature resistance. Some embodiments use E. coli AB4141 (F−, metC56, lct-1, thi-1, valS7, ara-14, lacY1, galK2, xyl-7, rpsL69, tfr-5, supE44), which contains a conditionally lethal, temperature sensitive valS (valyl-tRNA synthetase). This strain grows at a permissive temperature of about 37° C., but not at a restrictive temperature of about 42° C. After transformation with a plasmid expressing wild-type valS, the strain grows at the restrictive temperature. One embodiment uses a frameshifted vector derived from the plasmid pDH-1Δ11, which includes a wild-type valS gene. In this system, any colony growing at the restrictive temperature has the correct DNA sequence. Cells without the correct DNA sequence do not grow at all at the restrictive temperature.
Some embodiments provide a kit comprising one or more frameshifted vectors and instructions for using the frameshifted vector(s) to isolate a DNA sequence as described herein. Some embodiments of the kit also include other components, for example, a preselected organism, a growth medium, a restriction enzyme, and the like.
The method described herein provides a quick, easy, and inexpensive method for constructing a synthetic DNA gene that encodes any desired protein or any other desired nucleic acid. A first example provided herein describes a two-step recursive assembly of a gene encoding E. coli threonine deaminase, a protein with 514 amino acid residues (1,542 coding bases). A second example describes a three-step recursive assembly of a gene encoding the smallpox (variola) DNA polymerase, a protein with 1,005 amino acid residues (3,015 coding bases). A third example describes a direct self-assembly of a synthetic E. coli threonine deaminase gene, with reassembly by cloning into an expression vector. A fourth example describes a two-step recursive assembly with sampling and sequencing of the 876 bp Ty3 GAG3 gene, which encodes the Gag3p polyprotein. Ty3 is a retrotransposon in Saccharomyces cerevisiae. A fifth example describes a two-step recursive assembly with sampling and sequencing for the 1640 bp Ty3 integrase (“Ty3 IN”) gene. Accordingly, it will be appreciated that the method described herein may be used to construct any desired nucleic acid sequence or any desired gene.
E. coli threonine deaminase was chosen arbitrarily because (1) its size is comparable to most proteins, demonstrating wide applicability, (2) it assembles into a homo-tetramer, demonstrating protein-protein interactions, (3) its allosteric properties easily can be assessed for correct folding and assembly, and (4) in part by whim for old times sake because its structure and properties was the Ph.D. thesis project of one of us. See, Hatfield & Ray “Coupling of slow processes to steady state reactions” J. Biol. Chem., 1970, 245(7), 1753-4; Hatfield, Ray, & Umbarger “Threonine deaminase from Bacillus subtilis. III. Pre-steady state kinetic properties.” J. Biol. Chem., 1970, 245(7), 1748-53; Hatfield & Umbarger “Threonine deaminase from Bacillus subtilis. II. The steady state kinetic properties.” J. Biol. Chem., 1970, 245(7), 1742-7; Hatfield & Umbarger “Threonine deaminase from Bacillus subtilis. I. Purification of the enzyme.” J. Biol. Chem., 1970, 245(7), 1736-41.
Smallpox (variola) DNA polymerase was chosen because (1) it is larger than most proteins, demonstrating wide applicability, and (2) current events render it of special interest. In particular, the ability to synthesize de novo a gene from a pathogenic organism permits study without actual use of the pathogen. Examples of such studies include regulation of gene expression, drug development, vaccine development, and the like. Because the pathogen is never used, there is no chance of exposure, either accidental or intentional. Moreover, the sequence of the synthetic gene may be modified to optimize expression in the selected, non-pathogenic organism.
Ty3 was chosen because of its resemblance to retroviruses. GAG3 encodes Gag3p, a 38 kDa polyprotein that is processed into a 26 kDa capsid and 9 kDa nucleocapsid that assemble into virus-like particles (VLP). Ty3 IN is implicated in the retrovirus-like integration of Ty3 in the S. cerevisiae genome.
Preferred embodiments of the disclosed method are illustrated in the following Examples. In these Examples, the terms “Medium-sized piece” and “Intermediate Fragment” are used interchangeably.
EXAMPLE 1 illustrates the synthesis of an E. coli threonine deaminase gene by a two-step hierarchical decomposition and reassembly by overlap extension. E. coli threonine deaminase is a protein with 514 amino acid residues (1,542 coding bases).
Design
The sequence design method permuted synonymous (silent) codon assignments to each amino acid in the desired protein sequence. Each synonymous codon change results in a different artificial gene sequence that encodes the same protein. Because E. coli was the desired expression vector, the initial codon assignment was to pair each amino acid with its most frequent codon according to E. coli genomic codon usage statistics. Subsequently, the codon assignments were perturbed as described below. The final codon assignment implied a final DNA sequence to be achieved biochemically.
In this two-step hierarchical decomposition, the gene was divided first into five overlapping medium-sized pieces (in the present example, not longer than 340 bases, overlap not shorter than 33 bases), then each medium-sized piece was divided into several overlapping short segments (in the present example not longer than 50 bases, overlap not shorter than 18 bases). All overlaps were lengthened if necessary to include a terminal C or G for priming efficiency.
Theoretical melting temperatures were calculated with Mfold for all possible correct and incorrect hybridizations of the medium-sized and short pieces of DNA using the most common codons. The results are illustrated in
The final codon assignment to every amino acid in the threonine deaminase protein sequence is provided in
The design objectives may be achieved without excessive use of rare codons as illustrated in
The resulting DNA sequence was decomposed into the short overlapping segments shown in the overlap maps illustrated in
Synthesis
The target DNA sequence of the synthetic E. coli L-threonine deaminase gene has 1,542 bases. In the present example, the target DNA sequence was decomposed into a set of five medium-sized pieces, each medium-sized piece overlapping the adjacent medium-sized piece by at least 33 bases. In turn, each medium-sized piece was decomposed into “sets” of 11 or 12 small, single-stranded DNA segments, which overlapped the adjacent segment by from 18 to 50 bases. The five medium-sized pieces are designated “Medium-sized piece 0” through “Medium-sized piece 4” herein. The small single-stranded DNA segments that make up the medium-sized pieces are designated as “Seg,” medium-sized piece number, and segment number, where the segment number starts at 0 starting at the 5′-end of the forward segment. Note that the numbering for both the medium-sized pieces and the segments begins at zero. For example, the first segment of Medium-sized piece 0 is “Seg-0-0”; the seventh segment of Medium-sized piece 4 is “Seg-4-6.” The segments and primers were commercially synthesized by Illumina, Inc., San Diego, Calif.
Leader and Trailer Primers
For Medium-sized piece 0, the first segment, Seg-0-0, serves as the leader primer for overlap extension. The trailer primer (reverse complement) is 5′-GTAGCAGTAGGCATCAC-3′ (17-mer, SEQ. ID. NO.: 117). For Medium-sized piece 1, segment Seg-1-0 serves as the leader primer and the trailer primer is 5′-GGCTTCAACGGCTATCAC-3′ (18-mer, SEQ. ID. NO.: 118). For Medium-sized piece 2, segment Seg-2-0 serves as the leader primer and the trailer primer is 5′-GCTTAAGATGTGGGCCAG-3′ (18-mer, SEQ. ID. NO.: 119). For medium-sized piece 3, Seg-3-0 serves as the leader primer and Seg-3-11 serves as the trailer primer. For Medium-sized piece 4, segment Seg-4-0 serves as the leader primer and the trailer primer is 5′-TTAGCCTGCGAGGAAGAAAC-3′ (20-mer, SEQ. ID. NO.: 120). Those skilled in the art will appreciate that the segments may be designed such that no added primers are used, as for Medium-sized piece 3, such that one added primer is used, as for Medium-sized pieces 0-2, and 4, or such that two added primers are used, not illustrated. In particular, if an even number of segments is used, the segments may be designed such that no added primers are needed where the first segment serves as the leader primer and the last segment serves as the trailer primer. It will also be appreciated that different flanking sequences may be added easily to these leaders and trailers.
Assembly of the Five Medium-sized Pieces
First Overlap Extension Reactions. The five Medium-sized pieces were constructed in parallel by overlap extension and PCR from the appropriate set of single-stranded DNA sequences. The reaction mixture is provided in TABLE III and the thermocycler conditions in TABLE IV. The products of the overlap extension reactions were separated on a 1% agarose gel, shown in
*For each of the synthetic oligonucleotide sets (0-4), the forward and reverse complement synthetic oligonucleotides were mixed in equal amounts to a final concentration of 27.5 ng/μL.
PCR Reaction (enrichment). Each Medium-sized piece was separately enriched by PCR using the reaction mixture provided in TABLE V and the thermocycler conditions provided in TABLE VI. The products of the PCR reactions were separated on a 1% agarose gel shown in
Assembly of the Five Medium-Sized Pieces into a Full Length Gene
Second Overlap Extension Reaction. The synthetic threonine deaminase gene was constructed from the five medium-sized pieces from the PCR reactions using the reaction mixture provided in TABLE VII and the thermocycler conditions provided in TABLE VIII. After the reaction was complete, product was run on a 1.2% agarose gel (
1Forty-five base pair leader primer with Nde I endonuclease restriction enzyme site (bold): 5′-CTATATCTAGCATATGGCCGATTCTCAACCTCTGTCTGGAGCACC-3′ (SEQ. ID. NO.: 121).
2Forty-five base pair trailer primer (reverse complement) with BamH I endonuclease restriction enzyme site (bold): 5′-GTATTGGATCCTTAGCCTGCGAGGAAGAAACGAAAGGCGGGGTTG-3′ (SEQ. ID. NO.: 122).
Ligation and Expression
Directional cloning is performed on the purified threonine deaminase gene. pET14b Expression vector (Novagen) is cleaved with BamH I and Nde I generating compatible termini to the threonine deaminase. The threonine deaminase insert is ligated into the vector and this is used to transform into BL21 DE3 electro competent cells.
EXAMPLE 2 illustrates the synthesis of a variola DNA polymerase gene by a three-step hierarchical decomposition and reassembly by overlap extension. variola DNA polymerase is a protein with 1,005 amino acid residues (3,015 coding bases). variola DNA polymerase is also referred to as “Varpol” and “vpol” herein. Design
Because the variola DNA polymerase gene was intended for expression in E. coli, codon selection considerations were similar to those used in the design of E. coli threonine deaminase described in EXAMPLE 1.
The three-step hierarchical decomposition was performed as follows. First, the gene was divided into two large pieces of about 1,500 bases. The first large piece is designated herein as “Part I” or “polymerase-1.” The second large piece is referred to herein as “Part II” or “polymerase-2.” Parts I and II were designed with complementary Apa I sites to allow their reassembly by ligation. Second, each large piece was divided into five overlapping medium-sized pieces (Intermediate Fragments), in the present example, not longer than 340 bases, with overlaps not shorter than 33 bases. Third, each medium-sized piece was divided into several overlapping short segments, in the present example not longer than 50 bases, with overlaps not shorter than 18 bases. All overlaps were lengthened if necessary to include a terminal C or G for DNA polymerase priming efficiency.
The sequences of the DNA pieces were perturbed to provide a suitable gap between the lowest-melting correct match and the highest-melting wrong match. Theoretical melting temperatures were calculated as described in EXAMPLE 1. For Part I of the gene, the melting temperatures of the correct and incorrect matches for the small and medium-sized pieces for the most common E. coli codons are provided in
Theoretical melting temperatures for the small and medium-sized pieces of Part II of the variola DNA polymerase gene are provided in
Overlap maps for the overlapping short segments used to build each of the five Intermediate Fragments (medium-sized pieces) 0-4, leaders, and trailers that make up Part I of the variola DNA polymerase gene are provided in
Overlap maps for the overlapping short segments used to build each of the five Intermediate Fragments 0-4, leaders, and trailers that make up Part II of the variola DNA polymerase gene are provided in
Biochemistry
The variola DNA polymerase gene was assembled in the reverse order of the design process: first, the assembly of the ten Intermediate Fragments (five each for Parts I and II of the variola DNA polymerase gene); second, the assembly of the two 1500 bp large pieces, Parts I and II (which combined make up the full length gene); and finally, Apa I digestion of Part I (in the segment 1Seg-4-09) and Part II (in the segment 2Seg-2-00) to generate compatible flanking termini, allowing the two large pieces to be ligated together to generate the full-length variola DNA polymerase gene.
variola DNA polymerase Part I. Each of the five Intermediate Fragments 0-4 that make up Part I was assembled from synthetic oligonucleotide sets of alternating strand specificity that overlap one another. The sequences of the short ssDNA segments used to construct each of the Intermediate Fragments are provided in
For each of the Intermediate Fragments, the first segment (1Seg-0-00, 1Seg-1-00, 1Seg-2-00, 1Seg-3-00, and 1Seg-4-00)-serves as the leader primer for overlap extension. The last segment (1Seg-0-11, 1Seg-1-11, 1Seg-2-09, 1Seg-3-09, and 1Seg-4-09) serves as the trailer primer.
variola DNA polymerase Part II. Each of the five Intermediate Fragments 0-4 that make up Part II was assembled from synthetic oligonucleotide sets of alternating strand specificity that overlap one another. The sequences of the short ssDNA segments used to construct each of the Intermediate Fragments are provided in
For each of the Intermediate Fragments, the first segment (2Seg-0-00, 2Seg-1-00, 2Seg-2-00, 2Seg-3-00, and 2Seg-4-00) serves as the leader primer for overlap extension. The last segment (2Seg-0-11, 2Seg-1-11, 2Seg-2-09, 2Seg-3-09, and 2Seg-4-09) serves as the trailer primer.
Assembly of the Five Intermediate Fragments into Large Pieces Part I and Part II
Each Intermediate Fragment was separately constructed in a first overlap extension reaction from the appropriate set of ssDNA sequences provided in
*For each of the synthetic oligonucleotide sets (0-4), the forward and reverse complement synthetic oligonucleotides were mixed in equal amounts to a final concentration of 27.5 ng/μL.
The products of these reactions were separated on 1% agarose gels as shown in
Assembly of the Intermediate Fragments into Full Length Part I and Part II of the Variola DNA Polymerase Gene
Parts I and II of the variola DNA polymerase gene were assembled in a second overlap extension reaction of their constituent Intermediate Fragment sets using the reaction mixture provided in TABLE XI and the thermocycler program provided in TABLE XII.
*Taken directly from overlap extension reactions (
For the construction of Part I of the variola DNA polymerase gene, two overlap extension reactions were performed using the two different sets of primers illustrated in
For the construction of Part II of the variola DNA polymerase gene, two overlap extension reactions were performed using two different sets of primers illustrated in
The products of these reactions were separated on a 1% agarose gel, which is illustrated in
Ligation of Part I and Part II to Generate the Full Length Variola DNA Polymerase Gene (3000 bp)
The DNA from the lanes corresponding to Parts I and II of the variola DNA polymerase gene was purified using GENECLEAN® (BIO101 Systems®, Qbiogene). The two fragments were separately digested overnight with Apa I (Boehringer Mannheim), purified on a 1% agarose gel, and purified using GENECLEAN®.
Parts I and II were ligated for 1.25 hr at ambient temperature under the conditions provided in TABLE XIII. The products were quantified by fluorometry. Ligation was confirmed by PCR using the 1lead-01 and 2trail-57 primers, the product of which was isolated on a 1% agarose gel, shown in
EXAMPLE 3 illustrates the synthesis of an E. coli threonine deaminase gene by dividing the gene in a one-step hierarchical decomposition and synthesizing the gene by the direct self-assembly method.
Design
In the one-step hierarchical decomposition, the gene was divided directly into 54-overlapping short segments, in the present example not longer than about 60 bases, overlap not shorter than about 27 bases. Because the gene was designed for reassembly by ligation, the adjacent short segments on the same strand abut, i.e., with no single-stranded gaps between the double-stranded overlaps. The overlaps were not designed to terminate in a G or C.
Theoretical melting temperatures were calculated as described in EXAMPLE 1. The distribution of calculated melting temperatures of the short segments using the most common E. coli codons is provided in
The resulting DNA sequence was decomposed into the 54 short overlapping segments shown in an overlap map in
Biochemistry
The threonine deaminase gene was assembled in two steps by direct self-assembly. The sequences of the leaders, short ssDNA segments, and trailers are provided in
The threonine deaminase gene was first divided into four Medium-sized pieces of equal size, which were each reassembled by direct self-assembly and ligation, then the four Medium-sized pieces were combined and ligated into the full-length gene.
Direct Self-Assembly of the Four Medium-Sized Pieces
Each of the four Medium-sized pieces was constructed in parallel as follows.
Annealing Reaction. The short segments were first annealed in a thermocycler to form DNA constructs corresponding to the Medium-sized pieces. Each medium-sized piece was divided into two parts, a forward strand and a reverse strand. For each strand, a 20 μL solution of the short segments corresponding to each strand at a concentration of 0.825 μM was treated with 800 U of T4 polynucleotide kinase. The kinased, short segments corresponding to the forward and reverse strands of each Medium-sized piece were mixed together. A phenol extraction was performed on the mixture, followed by an ethanol precipitation and a 70% ethanol wash. The pellet was resuspended in 7.8 μL of 1×TE, from which a solution detailed in TABLE XIV was prepared. The concentration of each short segment was 1.65 μM in this solution. This solution was placed in a thermocycler, which was programmed as described in TABLE XV.
Ligation Reaction. Each Medium-sized piece was produced by ligation of the corresponding DNA construct synthesized in the annealing reaction, described above. The reaction mixture provided in TABLE XVI was maintained at 16° C. overnight. An agarose gel of the resulting four Medium-sized pieces is provided in
Assembly of the Four Medium-Sized Pieces into the Threonine Deaminase Gene
Each Medium-sized piece was isolated from the gel using GENECLEAN® (BIO101 Systems®, Qbiogene). A ligation reaction mixture containing the four Medium-sized pieces is provided in TABLE XVII. The ligation reaction was performed at 16° C. overnight. An agarose gel of the products, including the full-length threonine deaminase gene, is provided in
At this point the assembled gene may be stored at 4° C. or cloned into expression vector pET14b (Novagen) by ExoIII digestion. The threonine deaminase gene is designed to have 12 bp 3′-end overhangs which are compatible to 5′-overhang regions of the pET14b vector after it has been treated with ExoIII for 1 minute at 14° C. The insert and vector are ligated by mixing and heating, followed by cooling to a temperature below Tm for the overlapping regions of the insert and vector. The annealed fragments are transformed into an E. coli host at 37° C.
EXAMPLE 4 illustrates the synthesis of GAG3 by two-step recursive decomposition with sampling and sequencing. The GAG3 open reading frame (ORF) is 876 bp long.
GAG3 ORF was divided into three overlapping intermediate fragments for reassembly by overlap extension.
In the assembly of the intermediate fragments, the oligonucleotides were mixed to a final concentration of 0.1 μM with DNA polymerase (Proofstart®, Qiagen) and appropriate leader and trailer sequences.
The intermediate fragments were each cloned using a blunt-end ligation procedure (pCR-Blunt II-TOPO® vector, Invitrogen). Four clones of each intermediate fragment were sequenced and correct sequence was selected. The selected sequences were amplified out of the vector by PCR (Proofstart® DNA polymerase, Qiagen).
The intermediate fragments were mixed and extended to full-duplex DNA as described for the oligonucleotides.
The synthetic GAG3 gene was cloned into the pET-3a plasmid (Novagen) using Nde I and BamH I endonuclease restriction sites designed in the 5′ and 3′ PCR gene primers. The resulting plasmid contained the entire GAG3 gene under the control of an inducible T7 promoter, and a bacterial ribosome-binding site (Shine-Dalgarno sequence). The BL21(DE3) pLysS strain of E. coli (Novagen) was transformed with the plasmid. T7 RNA polymerase expression was induced using host-encoded isopropyl-1-thio-β-D-galactopyranoside at a concentration of 0.4 mM. At 30 min intervals, cells were harvested by centrifugation and sonicated.
EXAMPLE 5 illustrates the synthesis of the Ty3 IN gene by a two-step recursive decomposition with sampling and sequencing. The Ty3 IN gene is 1640 bp long and is illustrated in
The Ty3 IN gene was divided into ten intermediate fragments for reassembly by overlap extension. Overlap maps for the leader, ten intermediate fragments, and trailer are provided in
The ten intermediate fragments were separately assembled from the oligonucleotides, cloned, and sequenced as described in EXAMPLE 4.
An α-fragment frameshifted vector was constructed as illustrated in
The resulting frameshifted vector was transformed into E. coli JM109 (genotype: e14-(McrA-) recA1 endA1 gyrA96 thi-1 hsdR17(rK-mK+) supE44 relA1 Δ(lac-proAB) [F′ traD36 proAB laclqZΔM15]), and the cells grown on indicator LB agar, which contains isopropylthio-β-D-galactoside (IPTG) and 5-bromo-4-chloro-3-indolyl-β-D-galactoside (X-Gal). The resulting colonies were white.
As illustrated in
Those skilled in the art will understand the MCS in the α-fragment frameshifted vector described herein contains at least 13 distinct restriction sites, and consequently, in some embodiments, the DNA sequence is inserted into a different restriction site.
E. coli AB4141 contains a conditionally lethal, temperature sensitive valS gene. The strain will grow at a permissive temperature of about 37° C., but not at a restrictive temperature of about 42° C. The strain will grow at the restrictive temperature when transformed with a plasmid that expresses wild-type valS, however, for example, pDH1Δ11. Because the pCG2 plasmid does not express valS, E. coli AB4141 transformed therewith does not express wild-type valS, and consequently, does not grow at the restrictive temperature.
A method for using the pCG2 frameshifted vector to select a correct DNA sequence synthesized as described herein is illustrated in
The embodiments illustrated and described herein are provided as examples of certain preferred embodiments. Various changes and modifications can be made to these embodiments by those skilled in the art without departing from the teachings provided herein.
This application is a continuation-in-part of U.S. application Ser. No. 10/851,383, filed May 21, 2004, which claims the benefit of U.S. Provisional Patent Application No. 60/472,822, filed on May 22, 2003, the disclosures of which are incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60472822 | May 2003 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10851383 | May 2004 | US |
Child | 10903632 | Jul 2004 | US |