The invention is drawn to high throughput methods of gene discovery.
Given their diversity and abundance, microbial genomes represent an expansive untapped source for new gene discovery. Despite a relative lack of exploration, several gene families of agricultural and biomedical interest have been discovered in microbes and include genes that confer resistance to herbicides and pests in plants, as well as genes for antibiotic biosynthesis and antibiotic resistance. Current methods for new gene discovery from microbial genomes rely on screening isolated strains for activity in a bioassay and characterization of genes of interest by sequencing. However, complex samples containing mixed cultures of organisms often contain species that cannot be cultured or present other obstacles to performing traditional methods of gene discovery. Thus, a high throughput method of new gene identification where up to millions of culturable and non-culturable microbes can be queried simultaneously would be advantageous for identifying new genes or improved variants of known genes.
Compositions and methods for isolating new variants of known gene sequences are provided. The methods find use in identifying variants, particularly homologs in complex mixtures. Compositions comprise hybridization baits that hybridize to gene families of interest, particularly agricultural interest, in order to selectively enrich the polynucleotides of interest from complex mixtures. Bait sequences may be specific for a number of genes from distinct gene families of interest and may be designed to cover each gene of interest by at least 2-fold.
Thus methods disclosed herein are drawn to an oligonucleotide hybridization gene capture approach for identification of new genes of interest from environmental samples. This approach bypasses the need for labor-intensive microbial strain isolation, permits simultaneous discovery of genes from multiple gene families of interest, and increases the potential to discover genes from low-abundance and unculturable organisms present in complex mixtures of environmental microbes.
Methods for identifying variants of known gene sequences from complex mixtures are provided. The methods use labeled hybridization baits or bait sequences that correspond to a portion of known gene sequences to capture similar sequences from complex environmental samples. Once the DNA sequence is captured, subsequent sequencing and analysis can identify variants of the known gene sequences in a high throughput manner.
The methods of the invention are capable of identifying and isolating gene sequences, and variants thereof, from a complex sample. By “complex sample” is intended any sample having DNA from more than one species of organism. In specific embodiments, the complex sample is an environmental sample, a biological sample, or a metagenomic sample. As used herein, the term “metagenome” or “metagenomic” refers to the collective genomes of all microorganisms present in a given habitat (Handelsman et al., (1998) Chem. Biol. 5: R245-R249; Microbial Metagenomics, Metatranscriptomics, and Metaproteomics. Methods in Enzymology vol. 531 DeLong, ed. (2013)). Environmental samples can be from soil, rivers, ponds, lakes, industrial wastewater, seawater, forests, agricultural lands on which crops are growing or have grown, samples of plants or animals or other organisms associated with microorganisms that may be present within or without the tissues of the plant or animal or other organism, or any other source having biodiversity. Complex samples also include colonies or cultures of microorganisms that are grown, collected in bulk, and pooled for storage and DNA preparation. For example, colonies can be grown on plates, in bottles, in other bulk containers and collected. In certain embodiments, complex samples are selected based on expected biodiversity that will allow for identification of gene sequences, and variants thereof.
The method disclosed herein does not require purified samples of single organisms but rather is able to identify homologous sequences directly from uncharacterized mixes of populations of prokaryotic or eukaryotic organisms: from soil, from crude samples, and samples that are collected and/or mixed and not subjected to any purification. In this manner, the methods described herein can identify gene sequences, and variants thereof, from unculturable organisms, or those organisms that are difficult to culture.
New gene sequences of interest, variants thereof, and variants of known gene sequences can be identified using the methods disclosed herein. As used herein, a “gene sequence of interest,” “target sequence,” or “target sequences” is intended to refer to a known gene sequence. Known genes of interest include cry genes (Hofte and Whiteley (1989) Microbiol. Rev. 53(2):242-255; U.S. Pat. Nos. 8,609,936, 8,609,937; cyt genes (or other hemolytic toxin or pest control genes, such as those listed in U.S. Pat. No. 8,067,671); mtx (or other mosquitocidal) genes; Binary toxins (such as those listed in U.S. Pat. No. 7,655,838); VIPs (or other vegetative insecticidal proteins, such as those listed in U.S. Pat. No. 8,344,307); SIPs (or other soluble insecticidal proteins); herbicide resistance genes such as EPSPS; HPPD; 16S rRNA sequences; and housekeeping genes. In particular embodiments, the gene of interest is of agricultural importance, such as genes that confer resistance to diseases and pests, and/or tolerance to herbicides in plants. Genes of interest can also be of biological, industrial, or medical interest such as genes as for antibiotic biosynthesis and antibiotic resistance, or biosynthesis of enzymes or other factors involved in bioremediation, bioconversion, industrial processes, detoxification, biofuel production, or compounds having cytotoxic, immune system priming or other therapeutic activity. Table 1 provides examples of gene sequences that can be used in the methods and compositions disclosed herein. The sequences and references provided herein are incorporated by reference. It is important to note that these sequences are provided merely as examples; any sequences can be used in the practice of the methods and compositions disclosed herein.
The methods disclosed herein can identify variants of known sequences from multiple gene families of interest. As used herein, the term variants can refer to homologs, orthologs, and paralogs. While the activity of a variant may be altered compared to the gene of interest, the variant should retain the functionality of the gene of interest. For example, a variant may have increased activity, decreased activity, different spectrum of activity (e.g. for an insecticidal toxin gene) or any other alteration in activity when compared to the gene of interest.
In general, “variants” is intended to mean substantially similar sequences. For polynucleotides, a variant comprises a deletion and/or addition of one or more nucleotides at one or more internal sites within the native polynucleotide and/or a substitution of one or more nucleotides at one or more sites in the native polynucleotide. As used herein, a “native” or “wild type” polynucleotide or polypeptide comprises a naturally occurring nucleotide sequence or amino acid sequence, respectively. For polynucleotides, conservative variants include those sequences that, because of the degeneracy of the genetic code, encode the native amino acid sequence of the gene of interest. Naturally occurring allelic variants such as these can be identified with the use of well-known molecular biology techniques, as, for example, with polymerase chain reaction (PCR) and hybridization techniques as outlined below. Variant polynucleotides also include synthetically derived polynucleotides, such as those generated, for example, by using site-directed mutagenesis but which still encode the polypeptide of the gene of interest. Generally, variants of a particular polynucleotide disclosed herein will have at least about 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or more sequence identity to that particular polynucleotide (e.g., a gene of interest) as determined by sequence alignment programs and parameters described elsewhere herein.
Variants of a particular polynucleotide disclosed herein (i.e., the reference polynucleotide) can also be evaluated by comparison of the percent sequence identity between the polypeptide encoded by a variant polynucleotide and the polypeptide encoded by the reference polynucleotide. Percent sequence identity between any two polypeptides can be calculated using sequence alignment programs and parameters described elsewhere herein. Where any given pair of polynucleotides disclosed herein is evaluated by comparison of the percent sequence identity shared by the two polypeptides they encode, the percent sequence identity between the two encoded polypeptides is at least about 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or more sequence identity.
A. Sequence Analysis
As used herein, “sequence identity” or “identity” in the context of two polynucleotides or polypeptide sequences makes reference to the residues in the two sequences that are the same when aligned for maximum correspondence over a specified comparison window. When percentage of sequence identity is used in reference to proteins it is recognized that residue positions which are not identical often differ by conservative amino acid substitutions, where amino acid residues are substituted for other amino acid residues with similar chemical properties (e.g., charge or hydrophobicity) and therefore do not change the functional properties of the molecule. When sequences differ in conservative substitutions, the percent sequence identity may be adjusted upwards to correct for the conservative nature of the substitution. Sequences that differ by such conservative substitutions are said to have “sequence similarity” or “similarity”. Means for making this adjustment are well known to those of skill in the art. Typically this involves scoring a conservative substitution as a partial rather than a full mismatch, thereby increasing the percentage sequence identity. Thus, for example, where an identical amino acid is given a score of 1 and a non-conservative substitution is given a score of zero, a conservative substitution is given a score between zero and 1. The scoring of conservative substitutions is calculated, e.g., as implemented in the program PC/GENE (Intelligenetics, Mountain View, Calif.).
As used herein, “percentage of sequence identity” means the value determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison, and multiplying the result by 100 to yield the percentage of sequence identity.
Unless otherwise stated, sequence identity/similarity values provided herein refer to the value obtained using GAP Version 10 using the following parameters: % identity and % similarity for a nucleotide sequence using GAP Weight of 50 and Length Weight of 3, and the nwsgapdna.cmp scoring matrix; % identity and % similarity for an amino acid sequence using GAP Weight of 8 and Length Weight of 2, and the BLOSUM62 scoring matrix; or any equivalent program thereof. By “equivalent program” is intended any sequence comparison program that, for any two sequences in question, generates an alignment having identical nucleotide or amino acid residue matches and an identical percent sequence identity when compared to the corresponding alignment generated by GAP Version 10.
The use of the term “polynucleotide” is not intended to limit the present disclosure to polynucleotides comprising DNA. Those of ordinary skill in the art will recognize that polynucleotides can comprise ribonucleotides (RNA) and combinations of ribonucleotides and deoxyribonucleotides. Such deoxyribonucleotides and ribonucleotides include both naturally occurring molecules and synthetic analogues. The polynucleotides disclosed herein also encompass all forms of sequences including, but not limited to, single-stranded forms, double-stranded forms, hairpins, stem-and-loop structures, and the like.
The methods and compositions described herein employ bait sequences to capture genes of interest, or variants thereof, from complex samples. As used herein a “bait sequence” or “bait” refers to a polynucleotide designed to hybridize to a gene of interest, or variant thereof. In specific embodiments bait sequences are single-stranded RNA sequences capable of hybridizing to a fragment of the gene of interest. For example, the RNA bait sequence can be complementary to the DNA sequence of a fragment of the gene sequence of interest. In some embodiments, the bait sequence is capable of hybridizing to a fragment of the gene of interest that is at least 50, at least 70, at least 90, at least 100, at least 110, at least 120, at least 130, at least 140, at least 150, at least 170, at least 200, at least 250, at least 400, at least 1000 contiguous nucleotides, and up to the full-length polynucleotide sequence of the gene of interest. The baits can be contiguous or sequential RNA or DNA sequences. In one embodiment, bait sequences are RNA sequences. RNA sequences cannot self-anneal and work to drive the hybridization.
In specific embodiments, baits are at least 50, at least 70, at least 90, at least 100, at least 110, at least 120, at least 130, at least 140, at least 150, at least 170, at least 200, or at least 250 contiguous polynucleotides. For example, the bait sequence can be 50-200 nt, 70-150 nt, 100-140 nt, or 110-130 nt in length. The baits can be labeled with any detectable label in order to detect and/or capture the first hybridization complex comprised of a bait sequence hybridized to a fragment of the gene of interest, or variant thereof. In certain embodiments, the bait sequences are labeled with biotin, a hapten, or an affinity tag or the bait sequences are generated using biotinylated primers, e.g., where the baits are generated by nick-translation labeling of purified target organism DNA with biotinylated deoxynucleotides. In cases where the bait sequences are biotinylated, the target DNA can be captured using a binding partner, streptavidin molecule, attached to a solid phase. In specific embodiments, the baits are biotinylated RNA baits of about 120 nt in length. The baits may include adapter oligonucleotides suitable for PCR amplification, sequencing, or RNA transcription. The baits may include an RNA promoter or are RNA molecules prepared from DNA containing an RNA promoter (e.g., a T7 RNA promoter). Alternatively, antibodies specific for the RNA-DNA hybrid can be used (see, for example, WO2013164319 A1). In some embodiments, baits can be designed to 16S DNA sequences, or any other phylogenetically differential sequence, in order to capture sufficient portions of the 16S DNA to estimate the distribution of bacterial genera present in the sample.
The bait sequences span substantially the entire sequence of the known gene. In some embodiments, the bait sequences are overlapping bait sequences. As used herein, “overlapping bait sequences” or “overlapping” refers to fragments of the gene of interest that are represented in more than one bait sequence. For example, any given 120 nt segment of a gene of interest can be represented by a bait sequence having a region complementary to nucleotides 1-60 of the fragment, another bait sequence having a region complementary to nucleotides 61-120 of the fragment, and a third bait sequence complementary to nucleotides 1-120. In some embodiments, at least 10, at least 30, at least 60, at least 90, or at least 120 nucleotides of each overlapping bait overlap with at least one other overlapping bait. In this manner, each nucleotide of a given gene of interest can be represented in at least 2 baits, which is referred to herein as being covered by at least 2× tiling. Accordingly the method described herein can use baits or labeled baits described herein that cover any gene of interest by at least 2× or at least 3× tiling.
Baits for multiple genes can be used concurrently to hybridize with sample DNA prepared from a complex mixture. For example, if a given complex sample is to be screened for variants of multiple genes of interest, baits designed to each gene of interest can be combined in a bait pool prior to, or at the time of, mixing with prepared sample DNA. Accordingly, as used herein, a “bait pool” or “bait pools” refers to a mixture of baits designed to be specific for different fragments of an individual gene of interest and/or a mixture of baits designed to be specific for different genes of interest. “Distinct baits” refers to baits that are designed to be specific for different, or distinct, fragments of genes of interest.
Accordingly, in some embodiments, a method for preparing an RNA bait pool for the identification of genes of interest is provided. A given RNA bait pool can be specific for at least 1, at least 2, at least 10, at least 50, at least 100, at least 150, at least 200, at least 250, at least 300, at least 500, at least 750, at least 800, at least 900, at least 1,000, at least 1,500, at least 3,000, at least 5,000, at least 10,000, at least 15,000, at least 20,000, at least 30,000, at least 40,000, at least 50,000, at least 55,000, at least 60,000, or any other number of genes of interest. As used herein, a bait that is specific for a gene of interest is designed to hybridize to the gene of interest. A bait can be specific for more than one gene of interest or variants of a gene of interest. In specific embodiments, the sequences of the baits are designed to correspond to genes of interest using software tools such as Nimble Design (NimbleGen; Roche).
Methods of the invention include preparation of bait sequences, preparation of complex mixture libraries, hybridization selection, sequencing, and analysis. Such methods are set forth in the experimental section in more detail. Additionally, see NucleoSpin® Soil User Manual, Rev. 03, U.S. Publication No. 20130230857; Gnirke et al. (2009) Nature Biotechnology 27:182-189; SureSelectxT Target Enrichment System for Illumina Paired-End Sequencing Library Protocol, Version 1.6; NimbleGen SeqCap EZ Library SR User's Guide, Version 4.3; and NimbleGen SeqCap EZ Library LR User's Guide, Version 2.0. All of which are herein incorporated by reference.
Methods of preparing complex samples include fractionation and extraction of environmental samples comprising soil, rivers, ponds, lakes, industrial wastewater, seawater, forests, agricultural lands on which crops are growing or have grown, or any other source having biodiversity. Fractionation can include filtration and/or centrifugation to preferentially isolate microorganisms. In some embodiments, complex samples are selected based on expected biodiversity that will allow for identification of gene sequences, and variants thereof. Further methods of preparing complex samples include colonies or cultures of microorganisms that are grown, collected in bulk, and pooled for storage and DNA preparation. In certain embodiments, complex samples are subjected to heat treatment or pasteurization to enrich for microbial spores that are resistant to heating. In some embodiments, the colonies or cultures are grown in media that enrich for specific types of microbes or microbes having specific structural or functional properties, such as cell wall composition, resistance to an antibiotic or other compound, or ability to grow on a specific nutrient mix or specific compound as a source of an essential element, such as carbon, nitrogen, phosphorus, or potassium.
In order to provide sample DNA for hybridization to baits as described elsewhere herein, the sample DNA must be prepared for hybridization. Preparing DNA from a complex sample for hybridization refers to any process wherein DNA from the sample is extracted and reduced in size sufficient for hybridization, herein referred to as fragmentation. For example, DNA can be extracted from any complex sample directly, or by isolating individual organisms from the complex sample prior to DNA isolation. In some embodiments, sample DNA is isolated from a pure culture or a mixed culture of microorganisms. DNA can be isolated by any method commonly known in the art for isolation of DNA from environmental or biological samples (see, e.g. Schneegurt et al. (2003) Current Issues in Molecular Biology 5:1-8; Zhou et al. (1996) Applied and Environmental Microbiology 62:316-322), including, but not limited to, the NucleoSpin Soil genomic DNA preparation kit (Macherey-Nagel GmbH & Co., distributed in the US by Clontech). In one embodiment, extracted DNA can be enriched for any desired source of sample DNA. For example, extracted DNA can be enriched for prokaryotic DNA by amplification. As used herein, the term “enrich” or “enriched” refers to the process of increasing the concentration of a specific target DNA population. For example, DNA can be enriched by amplification, such as by PCR, such that the target DNA population is increased about 1.5-fold, about 2-fold, about 3-fold, about 5-fold, about 10-fold, about 15-fold, about 30-fold, about 50-fold, or about 100-fold. In certain embodiments, sample DNA is enriched by using 16S amplification.
In some embodiments, after DNA is extracted from a complex sample, the extracted DNA is prepared for hybridization by fragmentation (e.g., by shearing) and/or end-labeling. End-labeling can use any end labels that are suitable for indexing, sequencing, or PCR amplification of the DNA. The fragmented sample DNA may be about 100-1000, 100-500, 125-400, 150-300, 200-2000, 100-3000, at least 100, at least 150, at least 200, at least 250, at least 300, or about 350 nucleotides in length. The detectable label may be, for example, biotin, a hapten, or an affinity tag. Thus, in certain embodiments, sample DNA is sheared and the ends of the sheared DNA fragments are repaired to yield blunt-ended fragments with 5′-phosphorylated ends. Sample DNA can further have a 3′-dA overhang prior to ligation to indexing-specific adaptors. Such ligated DNA can be purified and amplified using PCR in order to yield the prepared sample DNA for hybridization. In other embodiments, the sample DNA is prepared for hybridization by shearing, adaptor ligation, amplification, and purification.
In some embodiments, RNA is prepared from complex samples. RNA isolated from complex samples contains genes expressed by the organisms or groups of organisms in a particular environment, which can have relevance to the physiological state of the organism(s) in that environment, and can provide information about what biochemical pathways are active in the particular environment (e.g. Booijink et al. 2010. Applied and Environmental Microbiology 76: 5533-5540). RNA so prepared can be reverse-transcribed into DNA for hybridization, amplification, and sequence analysis.
Baits can be mixed with prepared sample DNA prior to hybridization by any means known in the art. The amount of baits added to the sample DNA should be sufficient to bind fragments of a gene of interest, or variant thereof. In some embodiments, a greater amount of baits is added to the mixture compared to the amount of sample DNA. The ratio of bait to sample DNA for hybridization can be about 1:4, about 1:3, about 1:2, about 1:1.8, about 1:1.6, about 1:1.4, about 1:1.2, about 1:1, about 2:1, about 3:1, about 4:1, about 5:1, about 10:1, about 20:1, about 50:1, or about 100:1, and higher.
While hybridization conditions may vary, hybridization of such bait sequences may be carried out under stringent conditions. By “stringent conditions” or “stringent hybridization conditions” is intended conditions under which the bait will hybridize to its target sequence to a detectably greater degree than to other sequences (e.g., at least 2-fold over background). Stringent conditions are sequence-dependent and will be different in different circumstances. By controlling the stringency of the hybridization and/or washing conditions, target sequences that are 100% complementary to the bait can be identified (homologous probing). Alternatively, stringency conditions can be adjusted to allow some mismatching in sequences so that lower degrees of similarity are detected (heterologous probing). In specific embodiments, the prepared sample DNA is hybridized to the baits for 16-24 hours at about 45° C., about 50° C., about 55° C., about 60° C., or about 65° C.
Typically, stringent conditions will be those in which the salt concentration is less than about 1.5 M Na ion, typically about 0.01 to 1.0 M Na ion concentration (or other salts) at pH 7.0 to 8.3, and the temperature is at least about 30° C. for short baits (e.g., 10 to 50 nucleotides) and at least about 60° C. for long baits (e.g., greater than 50 nucleotides). Stringent conditions may also be achieved with the addition of destabilizing agents such as formamide. Exemplary low stringency conditions include hybridization with a buffer solution of 30 to 35% formamide, 1 M NaCl, 1% SDS (sodium dodecyl sulfate) at 37° C., and a wash in 1× to 2×SSC (20×SSC=3.0 M NaCl/0.3 M trisodium citrate) at 50 to 55° C. Exemplary moderate stringency conditions include hybridization in 40 to 45% formamide, 1.0 M NaCl, 1% SDS at 37° C., and a wash in 0.5×1×SSC at 55 to 60° C. Exemplary high stringency conditions include hybridization in 50% formamide, 1 M NaCl, 1% SDS at 37° C., and a wash in 0.1×SSC at 60 to 65° C. Other exemplary high-stringency conditions are those found in SureSelectXT Target Enrichment System for Illumina Paired-End Sequencing Library Protocol, Version 1.6 and NimbleGen SeqCap EZ Library SR User's Guide, Version 4.3. Optionally, wash buffers may comprise about 0.1% to about 1% SDS. Duration of hybridization is generally less than about 24 hours, usually about 4 to about 12 hours. The duration of the wash time will be at least a length of time sufficient to reach equilibrium.
Specificity is typically the function of post-hybridization washes, the critical factors being the ionic strength and temperature of the final wash solution. For DNA-DNA hybrids, the Tm can be approximated from the equation of Meinkoth and Wahl (1984) Anal. Biochem. 138:267-284: Tm=81.5° C.+16.6 (log M)+0.41 (% GC)−0.61 (% form)−500/L; where M is the molarity of monovalent cations, % GC is the percentage of guanosine and cytosine nucleotides in the DNA, % form is the percentage of formamide in the hybridization solution, and L is the length of the hybrid in base pairs. The Tm is the temperature (under defined ionic strength and pH) at which 50% of a complementary target sequence hybridizes to a perfectly matched bait. Tm is reduced by about 1° C. for each 1% of mismatching; thus, Tm, hybridization, and/or wash conditions can be adjusted to hybridize to sequences of the desired identity. For example, if sequences with >90% identity are sought, the Tm can be decreased 10° C. Generally, stringent conditions are selected to be about 5° C. lower than the thermal melting point (Tm) for the specific sequence and its complement at a defined ionic strength and pH. However, severely stringent conditions can utilize a hybridization and/or wash at 1, 2, 3, or 4° C. lower than the thermal melting point (Tm); moderately stringent conditions can utilize a hybridization and/or wash at 6, 7, 8, 9, or 10° C. lower than the thermal melting point (Tm); low stringency conditions can utilize a hybridization and/or wash at 11, 12, 13, 14, 15, or 20° C. lower than the thermal melting point (Tm). Using the equation, hybridization and wash compositions, and desired Tm, those of ordinary skill will understand that variations in the stringency of hybridization and/or wash solutions are inherently described. If the desired degree of mismatching results in a Tm of less than 45° C. (aqueous solution) or 32° C. (formamide solution), it is optimal to increase the SSC concentration so that a higher temperature can be used. An extensive guide to the hybridization of nucleic acids is found in Tijssen (1993) Laboratory Techniques in Biochemistry and Molecular Biology—Hybridization with Nucleic Acid Probes, Part I, Chapter 2 (Elsevier, New York); and Ausubel et al., eds. (1995) Current Protocols in Molecular Biology, Chapter 2 (Greene Publishing and Wiley-Interscience, New York). See Sambrook et al. (1989) Molecular Cloning: A Laboratory Manual (2d ed., Cold Spring Harbor Laboratory Press, Plainview, N.Y.).
As used herein, a hybridization complex refers to sample DNA fragments hybridizing to a bait. Following hybridization, the labeled baits can be separated based on the presence of the detectable label, and the unbound sequences are removed under appropriate wash conditions that remove the nonspecifically bound DNA and unbound DNA, but do not substantially remove the DNA that hybridizes specifically. The hybridization complex can be captured and purified from non-binding baits and sample DNA fragments. For example, the hybridization complex can be captured by using a streptavidin molecule attached to a solid phase, such as a bead or a magnetic bead. In such embodiments, the hybridization complex captured onto the streptavidin coated bead can be selected by magnetic bead selection. The captured sample DNA fragment can then be amplified and index tagged for multiplex sequencing. As used herein, “index tagging” refers to the addition of a known polynucleotide sequence in order to track the sequence or provide a template for PCR. Index tagging the captured sample DNA sequences can identify the DNA source in the case that multiple pools of captured and indexed DNA are sequenced together. As used herein, an “enrichment kit” or “enrichment kit for multiplex sequencing” refers to a kit designed with reagents and instructions for preparing DNA from a complex sample and hybridizing the prepared DNA with labeled baits. In certain embodiments, the enrichment kit further provides reagents and instructions for capture and purification of the hybridization complex and/or amplification of any captured fragments of the genes of interest. In specific embodiments, the enrichment kit is the SureSelectxT Target Enrichment System for Illumina Paired-End Sequencing Library Protocol, Version 1.6. In other specific embodiments, the enrichment kit is as described in the NimbleGen SeqCap EZ Library SR User's Guide, Version 4.3
Alternatively, the DNA from multiple complex samples can be indexed and amplified before hybridization. In such embodiments, the enrichment kit can be the SureSelectXT2 Target Enrichment System for Illumina Multiplexed Sequencing Protocol, Version D.0
Following hybridization, the captured target organism DNA can be sequenced by any means known in the art. Sequencing of nucleic acids isolated by the methods described herein is, in certain embodiments, carried out using massively parallel short-read sequencing systems such as those provided by Illumina®, Inc, (HiSeq 1000, HiSeq 2000, HiSeq 2500, Genome Analyzers, MiSeq systems), Applied Biosystems™ Life Technologies (ABI PRISM® Sequence detection systems, SOLiD™ System, Ion PGM™ Sequencer, ion Proton™ Sequencer), because the read out generates more bases of sequence per sequencing unit than other sequencing methods that generate fewer but longer reads. Sequencing can also be carried out by methods generating longer reads, such as those provided by Oxford Nanopore Technologies® (GridiON, MiniON) or Pacific Biosciences (Pacbio RS II). Sequencing can also be carried out by standard Sanger dideoxy terminator sequencing methods and devices, or on other sequencing instruments, further as those described in, for example, United States patents and patent applications U.S. Pat. Nos. 5,888,737, 6,175,002, 5,695,934, 6,140,489, 5,863,722, 2007/007991, 2009/0247414, 2010/01 11768 and PCT application WO2007/123744 each of which is incorporated herein by reference in its entirety.
Sequences can be assembled by any means known in the art. The sequences of individual fragments of genes of interest can be assembled to identify the full length sequence of the gene of interest, or variant thereof. In some embodiments, sequences are assembled using the CLC Bio suite of bioinformatics tools. Following assembly, sequences of genes of interest, or variants thereof, are searched (e.g., sequence similarity search) against a database of known sequences including those of the genes of interest in order to identify the gene of interest, or variant thereof. In this manner, new variants (i.e., homologs) of genes of interest can be identified from complex samples.
Kits are provided for identifying genes of interest or variants thereof, by the methods disclosed herein. The kits include a bait pool or RNA bait pool, or reagents suitable for producing a bait pool specific for a gene of interest, along with other reagents, such as a solid phase containing a binding partner of any detectable label on the baits. In specific embodiments, the detectable label is biotin and the binding partner streptavidin or streptavidin adhered to magnetic beads. The kits may also include solutions for hybridization, washing, or eluting of the DNA/solid phase compositions described herein, or may include a concentrate of such solutions.
Paenibacillus
popilliae
Paenibacillus
popilliae
Paenibacillus
popilliae
P. lentimorbus
semadara
P. popilliae popilliae
P. lentimorbus
semadara
P. lentimorbus
semadara
The article “a” and “an” are used herein to refer to one or more than one (i.e., to at least one) of the grammatical object of the article. By way of example, “an element” means one or more element.
All publications and patent applications mentioned in the specification are indicative of the level of those skilled in the art to which this invention pertains. All publications and patent applications are herein incorporated by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference.
Although the foregoing invention has been described in some detail by way of illustration and example for purposes of clarity of understanding, it will be obvious that certain changes and modifications may be practiced within the scope of the appended claims.
The following examples are offered by way of illustration and not by way of limitation.
Sampling and DNA preparation: Soil samples were collected from diverse environmental niches on private property in NC. Genomic DNA was prepared from 400 mg of each sample with the NucleoSpin Soil preparation kit from Clontech. In an alternative method, genomic DNA was prepared with the PowerMax Soil DNA Isolation Kit from Mo Bio Laboratories. Prior to DNA extraction, intact samples were preserved as glycerol stocks for future identification of the organism bearing genes of interest and for retrieval of complete gene sequences. Yields of DNA from soil samples ranged from 0.36 to 9.1 micrograms with A260/A280 ratios ranging from 1.50 to 1.89 (Table 2). Because soil DNA preparations have been reported to inhibit PCR reactions, which could hinder the gene enrichment protocol, DNA samples were used as template for PCR with primers designed against the microbial 16S rRNA. Samples 1-4 yielded a PCR product (Table 2), and those 4 samples were used for gene enrichment experiments. Additional DNA samples were prepared from pools of cultured environmental microbes containing up to 25,000 colonies.
To enrich these microbial pools for organisms likely to contain genes of interest, samples collected from about 920 diverse environmental sources were either (1) pasteurized to select for spore formers before plating on 0.1× LB medium, or (2) plated on media that selects for certain bacteria. Selection for certain species could include growth of environmental samples on defined carbon sources (for example, starch, mannitol, succinate or acetate), antibiotics (for example, cephalothin, vancomycin, polymyxin, kanamycin, neomycin, doxycycline, ampicillin, trimethoprim or sulfonamides), chromogenic substrates (for example, enzyme substrates such as phospholipase substrates, lecithinase substrates, cofactor metabolism substrates, nucleosidase substrates, glucosidase substrates, metalloprotease substrates and the like). Carbon or nitrogen sources could also be, for example, herbicides or other substrates that might select for organisms with herbicide detoxification or metabolism genes. Soil DNA preparations were spiked with genomic DNA from 4 organisms known to contain genes of interest at various ratios to serve as positive controls for the process (Table 2). In an alternative method, DNA from positive control strains was not included.
Table 3: Experimental design for gene enrichment experiments: DNA inputs for capture reactions including the environmental sample (described in Table 2), genes used as positive controls and the representation of genomic DNA from the positive control strains as a ratio to total DNA input.
Oligonucleotide baits: Baits for gene capture consisted of approximately 30,000 biotinylated 120 base RNA oligonucleotides that were designed against approximately 900 genes and represent 9 distinct gene families of agricultural interest (Table 4). The process is used iteratively such that each subsequent round of hybridization includes baits designed to genes discovered in a previous round of gene discovery. In addition to genes of interest, additional sequences were included as positive controls (housekeeping genes) and for microbe species identification (16S rRNA). Starting points for baits were staggered at 60 bases to confer 2× coverage for each gene. Baits were synthesized at Agilent with the SureSelect technology. However, additional products for similar use are available from Agilent and other vendors including NimbleGen (SeqCap EZ), Mycroarray (MYbaits), Integrated DNA Technologies (XGen), and LC Sciences (OligoMix).
Gene capture reactions: 3 μg of DNA was used as starting material for the procedure. DNA shearing, capture, post-capture washing and gene amplification are performed in accordance with Agilent SureSelect specifications. Throughout the procedure, DNA is purified with the Agencourt AMPure XP beads, and DNA quality was evaluated with the Agilent TapeStation. Briefly, DNA is sheared to an approximate length of 800 bp using a Covaris Focused-ultrasonicator. In an alternative method, DNA is sheared to lengths from about 400 to about 2000 bp, including about 500 bp, about 600 bp, about 700 bp, about 900 bp, about 1000 bp, about 1200 bp, about 1400 bp, about 1600 bp, about 1800 bp. The Agilent SureSelect Library Prep Kit was used to repair ends, add A bases, ligate the paired-end adaptor and amplify the adaptor-ligated fragments. Prepped DNA samples were lyophilized to contain 750 ng in 3.4 0_, and mixed with Agilent SureSelect Hybridization buffers, Capture Library Mix and Block Mix. Hybridization was performed for at least 16 hours at 65° C. In an alternative method, hybridization is performed at a lower temperature (55° C.). DNAs hybridized to biotinylated baits were precipitated with Dynabeads MyOne Streptavidin T1 magnetic beads and washed with SureSelect Binding and Wash Buffers. Captured DNAs were PCR-amplified to add index tags and pooled for multiplexed sequencing.
Genomic DNA libraries were generated by adding a predetermined amount of sample DNA to, for example, the Paired End Sample prep kit PE-1021001 Inc) following manufacturer's protocol. Briefly, DNA fragments were generated by random shearing and conjugated to a pair of oligonucleotides in a forked adaptor configuration. The ligated products are amplified using two oligonucleotide primers, resulting in double-stranded blunt-ended products having a different adaptor sequence on either end. The libraries once generated are applied to a flow cell for cluster generation.
Clusters were formed prior to sequencing using the TruSeq PE v3 cluster kit (ILLUMINA, Inc.) following manufacturer's instructions. Briefly, products from a DNA library preparation were denatured and single strands annealed to complementary oligonucleotides on the flow cell surface. A new strand was copied from the original strand in an extension reaction and the original strand was removed by denaturation. The adaptor sequence of the copied strand was annealed to a surface-bound complementary oligonucleotide, forming a bridge and generating a new site for synthesis of a second strand. Multiple cycles of annealing, extension and denaturation in isothermal conditions resulted in growth of clusters, each approximately 1 μm in physical diameter.
The DNA, in each cluster was linearized by cleavage within one adaptor sequence and denatured, generating single-stranded template for sequencing by synthesis (SBS) to obtain a sequence read. To perform paired-read sequencing, the products of read I can be removed by denaturation, the template was used to generate a bridge, the second strand was re-synthesized and the opposite strand was cleaved to provide the template for the second read. Sequencing was performed using the ILLUMINA, Inc. V4 SBS kit with 100 base paired-end reads on the HiSeq 2000. Briefly, DNA templates were sequenced by repeated cycles of polymerase-directed single base extension. To ensure base-by-base nucleotide incorporation in a stepwise manner, a set of four reversible terminators, A, C, G, and T, each labeled with a different removable fluorophore, was used. The use of modified nucleotides allowed incorporation to be driven essentially to completion without risk of over-incorporation. It also enabled addition of all four nucleotides simultaneously minimizing risk of misincorporation. After each cycle of incorporation, the identity of the inserted base was determined by laser-induced excitation of the fluorophores and fluorescence imaging was recorded. The fluorescent dye and linker were removed to regenerate an available group ready for the next cycle of nucleotide addition. The HiSeq sequencing instrument is designed to perform multiple cycles of sequencing chemistry and imaging to collect sequence data automatically from each cluster on the surface of each lane of an eight ane flow cell.
Bioinformatics: Sequences were assembled using the CLC Bio suite of bioinformatics tools. The presence of genes of interest (Table 4) was determined by BLAST query against a database of those genes of interest. Diversity of organisms present in the sample can be evaluated from 16S identifications. Process QC was evaluated based on retrieval of positive control sequences that are included in the reactions. To assess the capacity of this approach for new gene discovery, assembled genes, as well as individual sequencing reads, were BLASTed against baits and gene sequences published in public databases including NCBI and PatentLens. The lowest % identity read to a bait was 66%, while the lowest % identity to a gene was 77%. Example genes that were captured and sequenced with this method are shown in Table 6.
Novel Gene Confirmation: To confirm actual physical presence of genes predicted from sample enrichment, capture, and sequencing of captured DNA, oligonucleotide primers were designed to amplify sequences from DNA samples to confirm that the actual sequence matched the predicted sequence. Genes were called “novel homologs” if they contained domain characteristics of a targeted known gene but had less than 95% identity to a known gene.
Number | Date | Country | |
---|---|---|---|
61925422 | Jan 2014 | US |