The present invention relates to the fields of molecular biology and genetics. The invention relates to improved strategies for determining the sequence of, preferably complex (i.e. large) genomes, based on the use of high throughput sequencing technologies.
Assembly of whole genome shotgun sequences of large genomes (from 100 Mbp upwards) to draft genome sequences is a complex issue. Many plants and animals further contain a large number of repeat sequences, thereby further complicating the problem. This computational problem is further enlarged by the emergence of high throughput sequencing technologies, such as by technologies of 454 Life Science. These technologies are often no longer based on Sanger dideoxysequencing, but predominantly on sequencing by synthesis (pyrosequencing), which is easier to perform on a solid surface. Sequencing by synthesis provides a large amount of sequences, albeit of a relative short length (about 100 bp) compared to the relatively large length of 500 to 1000 bp as is common for Sanger dideoxysequencing.
On of the disadvantages of such short fragments is that the assembly of contigs to determine the genome sequence requires enormous computational power, making the current methods of sequencing a relatively expensive and time consuming quest. Consequently there is a need for cheap, reliable and fast methods of sequencing complex, i.e. large genomes to further the technology to what is sometimes called the “1000$ genome”, i.e. a method that allows the determination of the entire sequence of a complex genome (human in particular) for not more than 1000$. This would allow i.a. for the development of personalized medication.
The present inventors have now found that with a different strategy this problem can be solved and the high throughput sequencing technologies can be efficiently used in genome assembly.
The invention comprises employing a technology that divides the genome in reproducible and complementary parts by restricting the genome with one or more restriction endonucleases to yield a set of restriction fragments and subsequently providing a subset of restriction fragments by selective amplification. The subset is sequenced and assembled to a contig. By repeating this step for one or more different sets of restriction endonucleases, different contigs are obtained. These different contigs are used to assemble the draft genome sequence. The invention does not require any knowledge of the sequence and can be applied to genomes of any size and complexity. The invention can be scaled up for any type and size of the genome. The present invention provides a quicker, reliable and faster access to any genome of interest and thereby provides for accelerated analysis of the genome.
In the following description and examples a number of terms are used. In order to provide a clear and consistent understanding of the specification and claims, including the scope to be given such terms, the following definitions are provided. Unless otherwise defined herein, all technical and scientific terms used have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The disclosures of all publications, patent applications, patents and other references are incorporated herein in their entirety by reference.
Nucleic acid: a nucleic acid according to the present invention may include any polymer or oligomer of pyrimidine and purine bases, preferably cytosine, thymine, and uracil, and adenine and guanine, respectively (See Albert L. Lehninger, Principles of Biochemistry, at 793-800 (Worth Pub. 1982) which is herein incorporated by reference in its entirety for all purposes). The present invention contemplates any deoxyribonucleotide, ribonucleotide or peptide nucleic acid component, and any chemical variants thereof, such as methylated, hydroxymethylated or glycosylated forms of these bases, and the like. The polymers or oligomers may be heterogenous or homogenous in composition, and may be isolated from naturally occurring sources or may be artificially or synthetically produced. In addition, the nucleic acids may be DNA or RNA, or a mixture thereof, and may exist permanently or transitionally in single-stranded or double-stranded form, including homoduplex, heteroduplex, and hybrid states.
Complexity reduction: the term complexity reduction is used to denote a method wherein the complexity of a nucleic acid sample, such as genomic DNA, is reduced by the generation of a subset of the sample. This subset can be representative for the whole (i.e. complex) sample and is preferably a reproducible subset. Reproducible means in this context that when the same sample is reduced in complexity using the same method, the same, or at least comparable, subset is obtained. The method used for complexity reduction may be any method for complexity reduction known in the art. Examples of methods for complexity reduction include for example AFLP® (Keygene N.V., the Netherlands; see e.g. EP 0 534 858), the methods described by Dong (see e.g. WO 03/012118, WO 00/24939), indexed linking (Unrau et al., vide infra), etc. The complexity reduction methods used in the present invention have in common that they are reproducible. Reproducible in the sense that when the same sample is reduced in complexity in the same manner, the same subset of the sample is obtained, as opposed to more random complexity reduction such as microdissection or the use of mRNA (cDNA) which represents a portion of the genome transcribed in a selected tissue and for its reproducibility is depending on the selection of tissue, time of isolation etc.
Tagging: the term tagging refers to the addition of a tag to a nucleic acid sample in order to be able to distinguish it from a second or further nucleic acid sample. Tagging can e.g. be performed by the addition of a sequence identifier during complexity reduction or by any other means known in the art. Such sequence identifier can e.g. be a unique base sequence of varying but defined length uniquely used for identifying a specific nucleic acid sample. Typical examples thereof are for instance ZIP sequences. Using such tag, the origin of a sample can be determined upon further processing. In case of combining processed products originating from different nucleic acid samples, the different nucleic acid samples should be identified using different tags.
Tagged library: the term tagged library refers to a library of tagged nucleic acid.
Sequencing: The term sequencing refers to determining the order of nucleotides (base sequences) in a nucleic acid sample, e.g. DNA or RNA.
Aligning and alignment: With the term “aligning” and “alignment” is meant the comparison of two or more nucleotide sequence based on the presence of short or long stretches of identical or similar nucleotides. Several methods for alignment of nucleotide sequences are known in the art, as will be further explained below. Sometimes assembling is used as a synonym.
High-throughput screening: High-throughput screening, often abbreviated as HTS, is a method for scientific experimentation especially relevant to the fields of biology and chemistry. Through a combination of modern robotics and other specialised laboratory hardware, it allows a researcher to effectively screen large amounts of samples simultaneously.
Restriction endonuclease: a restriction endonuclease or restriction enzyme is an enzyme that recognizes a specific nucleotide sequence (target site) in a double-stranded DNA molecule, and will cleave both strands of the DNA molecule at every target site.
Restriction fragments: the DNA molecules produced by digestion with a restriction endonuclease are referred to as restriction fragments. Any given genome (or nucleic acid, regardless of its origin) will be digested by a particular restriction endonuclease into a discrete set of restriction fragments. The DNA fragments that result from restriction endonuclease cleavage can be further used in a variety of techniques and can for instance be detected by gel electrophoresis.
Gel electrophoresis: in order to detect restriction fragments, an analytical method for fractionating double-stranded DNA molecules on the basis of size can be required. The most commonly used technique for achieving such fractionation is (capillary) gel electrophoresis. The rate at which DNA fragments move in such gels depends on their molecular weight; thus, the distances traveled decrease as the fragment lengths increase. The DNA fragments fractionated by gel electrophoresis can be visualized directly by a staining procedure e.g. silver staining or staining using ethidium bromide, if the number of fragments included in the pattern is sufficiently small. Alternatively further treatment of the DNA fragments may incorporate detectable labels in the fragments, such as fluorophores or radioactive labels.
Ligation: the enzymatic reaction catalyzed by a ligase enzyme in which two double-stranded DNA molecules are covalently joined together is referred to as ligation. In general, both DNA strands are covalently joined together, but it is also possible to prevent the ligation of one of the two strands through chemical or enzymatic modification of one of the ends of the strands. In that case the covalent joining will occur in only one of the two DNA strands.
Synthetic oligonucleotide: single-stranded DNA molecules having preferably from about 10 to about 50 bases, which can be synthesized chemically are referred to as synthetic oligonucleotides. In general, these synthetic DNA molecules are designed to have a unique or desired nucleotide sequence, although it is possible to synthesize families of molecules having related sequences and which have different nucleotide compositions at specific positions within the nucleotide sequence. The term synthetic oligonucleotide will be used to refer to DNA molecules having a designed or desired nucleotide sequence.
Adaptors: short double-stranded DNA molecules with a limited number of base pairs, e.g. about 10 to about 30 base pairs in length, which are designed such that they can be ligated to the ends of restriction fragments. Adaptors are generally composed of two synthetic oligonucleotides which have nucleotide sequences which are partially complementary to each other. When mixing the two synthetic oligonucleotides in solution under appropriate conditions, they will anneal to each other forming a double-stranded structure. After annealing, one end of the adaptor molecule is designed such that it is compatible with the end of a restriction fragment and can be ligated thereto; the other end of the adaptor can be designed so that it cannot be ligated, but this need not be the case (double ligated adaptors).
Adaptor-ligated restriction fragments: restriction fragments that have been capped by adaptors as a result of ligation.
Primers: in general, the term primers refers to a DNA strand which can prime the synthesis of DNA. DNA polymerase cannot synthesize DNA de novo without primers: it can only extend an existing DNA strand in a reaction in which the complementary strand is used as a template to direct the order of nucleotides to be assembled. We will refer to the synthetic oligonucleotide molecules which are used in a polymerase chain reaction (PCR) as primers.
DNA amplification: the term DNA amplification will be typically used to denote the in vitro synthesis of double-stranded DNA molecules using PCR. It is noted that other amplification methods exist and they may be used in the present invention without departing from the gist thereof.
The present invention provides for a method for determining a genome sequence comprising the steps of:
In step (a) of the method, the genome of interested is subjected to one or more restriction endonucleases. In certain embodiments, at least two restriction endonucleases are used. In certain embodiments, in particular with large genomes, three or more restriction endonucleases can be used. Digestion of the genome provides a first subset of the genome. The restriction endonucleases can be frequent cutters (i.e. typically 4 and 5 cutters, i.e. restriction endonucleases that have a recognition sequence of 4 or 5 nucleotides, respectively) or may be rare cutters, (i.e. typically 6 and higher cutters (7, 8, . . . etc., i.e. restriction endonucleases that have a recognition sequence of 6 or more nucleotides, respectively), or combinations thereof. In certain embodiments a combination of a rare and frequent cutter is used. In certain embodiments two rare cutters may be used. The restriction endonucleases can be of any type, including IIs and IIsa types that cut the DNA outside their recognition sequence, either on one or on both sides of the recognition sequence.
In step (b) of the method, at least one adaptor is ligated to the restriction fragments obtained in step (a). Preferably, the adaptors are such that the restriction site is not restored upon ligation of the adaptor. It is also possible, for instance in case of two or more restriction endonucleases to employ two or more different adaptors. This ligation step yields adaptor-ligated restriction fragments. The adaptors, depending on the restriction endonuclease, can be blunt ended or may contain an overhang.
In certain embodiments, the adaptor may be a set of adaptors known as indexing linkers (Unrau et al., 1994, Gene, 145:163-169).
In step (c), the first set of adaptor-ligated restriction fragments is amplified using a first primer combination. The primer combination comprises at least a first primer that contains a section that is complementary to (at least part of) the adaptor and to part of the recognition sequence of the restriction endonuclease used in the restriction of the genome. Typically, the part of the recognition sequence is that part that remains after restriction of the sequence with the restriction endonuclease. At its 3′-end the primer contains a first selected sequence. The first selected sequence comprises a previously selected set of 1-10 nucleotides, preferably 1-8 selected nucleotides, preferably 1-5, more preferably 1-3. Such a primer may have the following, illustrative, structure (for 2 selective nucleotides (AC)) “5′-adaptor specific region-restriction sequence specific region-AC-3′”. This exemplary first primer thus contains 2 selective nucleotides AC which will only amplify adaptor-ligated fragments that contain the complementary TG as the first two nucleotides derived from the sequence of the restriction fragment. This provides the first subset of amplified adaptor-ligated restriction fragments.
The first primer combination may also comprise two selective primers, each carrying a selected sequence at their 3′-end. The primers can be tagged to allow for pooling strategies.
The amplification is preferably carried out using PCR. In certain embodiments the use of Long-Range PCR is preferred.
In step (d), the selective amplification is repeated with second and further primer combinations. At least one of the primers in each of the further primer combinations contains a different selected sequence at its 3′-end. The selection of the selected sequences is such that, given the number of selected nucleotides, all possible permutations of the selective nucleotides are used. In the above example this means AT, AG, AA, CA, CT, CG, CA etc. In practice this means that all adaptor-ligated restriction fragments within the subset of the genome (i.e. within the set of restriction fragments obtained using the one or more restriction endonucleases) haven been amplified.
In a preferred embodiment of the invention, the reduction of the complexity of the genome by selective amplification is performed by means of AFLP® (Keygene N.V., the Netherlands; see e.g. EP 0 534 858 and Vos et al. (1995). AFLP: a new technique for DNA fingerprinting, Nucleic Acids Research, vol. 23, no. 21, 4407-4414, which are herein incorporated in their entirety by reference).
AFLP is a method for selective restriction fragment amplification. AFLP does not require any prior sequence information and can be performed on any starting DNA. In general, AFLP comprises the steps of:
AFLP thus provides a reproducible subset of adaptor-ligated fragments. One useful variant of the AFLP technology uses no selective nucleotides (i.c. +0/+0 primers) and is sometimes called linker PCR. This also provides for a very suitable complexity reduction, in particular for smaller genomes.
For a further description of AFLP, its advantages, its embodiments, as well as the techniques, enzymes, adaptors, primers and further compounds and tools used therein, reference is made to U.S. Pat. No. 6,045,994, EP-B-0 534 858, EP 976835 and EP 974672, WO01/88189 and Vos et al. Nucleic Acids Research, 1995, 23, 4407-4414, which are hereby incorporated in their entirety.
Thus, in a preferred embodiment of the method of the present invention, the genome is reduced in complexity by
AFLP is a highly reproducible method for complexity reduction and is therefore particularly suited for the method according to the present invention.
Hitherto in the art of sequencing technology, the use of this selective amplification in the sequence determination of whole genomes, and in particular in complex genomes has not been disclosed or suggested. The AFLP-technology is known in the art as a fingerprinting technology and has not yet been identified as a solution to aid in the sequencing of complex genomes. In particular, the use of a set of primer combinations that cover all or most of the permutations of nucleotides for a given number of selective nucleotides (for instance 16 primer combinations in the case of two selective nucleotides) provides for a reliable and quick method to provide for complementary and reproducible subsets of a genome that can be sequenced. In certain embodiments, the primers used in the complexity reduction contain one or more thioate linkages to increase their selectivity and/or performance.
In certain alternative embodiments, complexity reduction comprises the CHIP method. Other suitable methods for complexity reduction are Chromatine Immuno Precipitation (ChiP). This means that nuclear DNA is isolated, whilst proteins such as transcription factors are linked to the DNA. With ChiP first an antibody is used against the protein, resulting in Ab-protein-DNA complex. By purifying this complex and precipitating it, DNA to which this protein binds is selected. Subsequently, the DNA can be used for library construction and sequencing. I.e., this is a method to perform a complexity reduction in a non-random fashion directed to specific functional areas; in the present example specific transciption factors. Alternative embodiments may use the design of PCR primers directed against conserved motifs such as SSRs, NBS regions (nucleotide biding regions), promoter/enhancer sequences, telomer consensus sequences, MADS box genes, ATP-ase gene families and other gene families.
In step (e), first, second and further sequencing libraries are generated for each subset of amplified adaptor-ligated restriction fragments. The libraries are typically generated by fragmentation of the amplified adaptor-ligated restriction fragments. Fragmentation can be achieved by physical techniques, i.e. shearing, sonication or other random fragmentation methods. In step (f), at least part, but preferably the entire, nucleotides sequence of at least part of, but preferably of all the fragments contained in the libraries is determined.
The sequencing may in principle be conducted by any means known in the art, such as the dideoxy chain termination method. It is however preferred that the sequencing is performed using high-throughput sequencing methods, such as the methods disclosed in WO 03/004690, WO 03/054142, WO 2004/069849, WO 2004/070005, WO 2004/070007, and WO 2005/003375 (all in the name of 454 Life Sciences), by Seo et al. (2004) Proc. Natl. Acad. Sci. USA 101:5488-93, and technologies of Helios, Solexa, US Genomics, etcetera, which are herein incorporated by reference. It is most preferred that sequencing is performed using the apparatus and/or method disclosed in WO 03/004690, WO 03/054142, WO 2004/069849, WO 2004/070005, WO 2004/070007, and WO 2005/003375 (all in the name of 454 Life Sciences), which are herein incorporated by reference. The technology described allows sequencing of 40 million bases in a single run and is 100 times faster and cheaper than competing technology. The sequencing technology roughly consists of 5 steps: 1) fragmentation of DNA and ligation of specific adaptor to create a library of single-stranded DNA (ssDNA); 2) annealing of ssDNA to beads, emulsification of the beads in water-in-oil microreactors and performing emulsion PCR to amplify the individual ssDNA molecules on beads; 3) selection of /enrichment for beads containing amplified ssDNA molecules on their surface 4) deposition of DNA carrying beads in a PicoTiterPlate®; and 5) simultaneous sequencing in 100,000 wells by generation of a pyrophosphate light signal. The method will be explained in more detail below.
In a preferred embodiment, the sequencing comprises the steps of:
In the first step (a), sequencing adaptors are ligated to fragments within the combination library. Said sequencing adaptor includes at least a “key” region for annealing to a bead, a sequencing primer region and a PCR primer region. Thus, adapted fragments are obtained.
In a first step, adapted fragments are annealed to beads, each bead annealing with a single adapted fragment. To the pool of adapted fragments, beads are added in excess as to ensure annealing of one single adapted fragment per bead for the majority of the beads (Poisson distribution).
In a next step, the beads are emulsified in water-in-oil microreactors, each water-in-oil microreactor comprising a single bead. PCR reagents are present in the water-in-oil microreactors allowing a PCR reaction to take place within the microreactors. Subsequently, the microreactors are broken, and the beads comprising DNA (DNA positive beads) are enriched.
In a following step, the beads are loaded in wells, each well comprising a single bead. The wells are preferably part of a PicoTiter™Plate allowing for simultaneous sequencing of a large amount of fragments.
After addition of enzyme-carrying beads, the sequence of the fragments is determined using pyrosequencing. In successive steps, the PicoTiter™Plate and the beads as well as the enzyme beads therein are subjected to different deoxyribonucleotides in the presence of conventional sequencing reagents, and upon incorporation of a deoxyribonucleotide a light signal is generated which is recorded. Incorporation of the correct nucleotide will generate a pyrosequencing signal which can be detected.
Pyrosequencing itself is known in the art and described inter alia on www.biotagebio.com; www.pyrosequencing.com/section technology. The technology is further applied in e.g. WO 03/004690, WO 03/054142, WO 2004/069849, WO 2004/070005, WO 2004/070007, and WO 2005/003375 (all in the name of 454 Life Sciences), which are herein incorporated by reference.
In step (g) of the method of the invention, the determined sequences of the fragments of the first, second and/or further libraries are aligned. The alignment provides contigs of the fragments in the subsets of the amplified adaptor-ligated restriction fragments. In this way for each amplified adaptor-ligated restriction fragment, a contig is generated from sequenced fragments, i.e. the contig of one amplified adaptor-ligated restriction fragment, is build up from the alignment of the sequence of the various fragments obtained from the fragmenting in step (e). By building contigs from sequences representing dispersed restriction fragments of a small portion of the genome, problems with contig building resulting from abundant repeat sequences are greatly diminished leading to a higher quality draft genome sequence which contains less errors due to false-joining of repeated sequences. In addition, the assembly process will be computationally less complex and therefore faster to perform. By aligning the sequences in the different libraries, contigs for each restriction fragment of the set of restriction fragments can be build for each primer combination. This results in a set of contigs, each corresponding to a particular restriction fragment. As a result, each restriction fragment obtained from the restriction of the genome with the at least one restriction endonuclease has now a determined (contig) sequence. The method of the invention is illustrated in
Methods of alignment of sequences for comparison purposes are well known in the art. Various programs and alignment algorithms are described in: Smith and Waterman (1981) Adv. Appl. Math. 2:482; Needleman and Wunsch (1970) J. Mol. Biol. 48:443; Pearson and Lipman (1988) Proc. Natl. Acad. Sci. USA 85:2444; Higgins and Sharp (1988) Gene 73:237-244; Higgins and Sharp (1989) CABIOS 5:151-153; Corpet et al. (1988) Nucl. Acids Res. 16:10881-90; Huang et al. (1992) Computer Appl. in the Biosci. 8:155-65; and Pearson et al. (1994) Meth. Mol. Biol. 24:307-31, which are herein incorporated by reference. Altschul et al. (1994) Nature Genet. 6:119-29 (herein incorporated by reference) present a detailed consideration of sequence alignment methods and homology calculations.
The NCBI Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990) is available from several sources, including the National Center for Biological Information (NCBI, Bethesda, Md.) and on the Internet, for use in connection with the sequence analysis programs blastp, blastn, blastx, tblastn and tblastx. It can be accessed at <http://www.ncbi.nlm.nih.gov/BLAST/>. A description of how to determine sequence identity using this program is available at <http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html>. A further application can be in microsatellite mining (see Varshney et al. (2005) Trends in Biotechn. 23(1):48-55.
Typically, the alignment is performed on sequence data that have been trimmed for the adaptors/primer and/or identifiers but with reconstructed restriction enzyme recognition sequences, i.e. using only the sequence data from the fragments that originate from the nucleic acid sample. Typically, the sequence data obtained are used for identifying the origin of the fragment (i.e. from which sample), the sequences derived from the adaptor and/or identifier are removed from the data and alignment is performed on this trimmed set.
In step (h), the whole procedure is repeated at least once with one or more different restriction endonucleases, i.e. a restriction endonuclease that, preferably, contains a different recognition site than the first endonuclease, to provide for a second or even further set of restriction fragments that are subsequently adaptor ligated, selectively amplified using primer combinations with a selected sequence that is independently selected, i.e. bears no relationship with the selected sequence, (either in number or in type of nucleotides) with the ones that have been used for the same purpose with the first restriction endonuclease. For example, the first subset can be obtained by restriction with MseI/PstI and selectively amplified using a selective primer for the MseI-remains of the recognition site and that carries 2 selective nucleotides at its 3′ end. The second subset can be obtained by EcoRI/HindIII digestion and selective amplification with a selective primer for the EcoRI-remains of the recognition site with 1 selective nucleotide.
Thus a second (and/or further) set of contigs for all restriction fragments can be obtained in this way, in a similar manner as disclosed herein before. This is necessary, as for a given restriction endonuclease, the fractions of the genome that are being sequenced are complementary, they do not overlap. The contigs obtained with different enzyme combination do overlap and hence allow the generation of a contig therefrom and hence allow the formation of a (draft) genome sequence.
In step (i) of the method, the contigs obtained from the previous steps of the method for each fragment are aligned to form a sequence of the genome.
In certain embodiments, the contig building of either the restriction fragments or of the genome sequence can be aided by the use of nucleotide sequences of the genome that are derived from other sources, including, but not limited to BAC-end sequences, BAC shotgun sequences, EST sequences or whole genome shotgun sequences.
The method of the present invention is independent of the source of DNA, i.e. applicable to all organisms as it does not require any previous sequence information. By appropriate selection of enzymes, adaptors, primers and number of selective nucleotides scalable technology is presented that is applicable to genomes of all sizes and complexities. Furthermore the genome fractions that are obtained with the different primer combinations and/or with the selective primers that differ from each other in the specific selective nucleotide sequence at the 3′-end, are complementary. This means that when, for any given number of selective nucleotides, all permutations are being used (1 selective nucleotide equals 4 variants (A, C, T,G), 2 selective nucleotides 16, 3 selective nucleotides 64 variations and so on), the combined restriction fragments constitute the restricted genome.
The invention can be illustrated by means of the following examples that are not intended to limit the invention in any way, but merely serve as an illustration.
Whole Genome Sequencing using Long-Range PCR.
Step 1: DNA is restricted using two 6 cutters A and B for instance EcoRI and HindIII). This generates three types of restriction fragments: A-A (25%), B-B (25%) and A-B (509%) with an average length of about 3-4 kb, depending on the GC-content of the genome of interest, and the selection of the restriction endonucleases. After ligation of adaptors, long-range PCR is performed with +X/+Y primers (i.e. one of the primers contains X selective nucleotides and the other Y), to 1 Mb sequence per primer combination. In the case that X=2 and Y=3, repeat this for all 1024 primer combinations. In the case that X=1 and Y=2, repeat this for all 64 primer combinations.
Step 2: construction of libraries by shearing of each set of amplified adaptor-ligated restriction fragments and sequencing using the emulsion PCR in combination with pyrosequencing of 454 Technologies as described herein elsewhere, for each primer combination (1024 or 64 times). The sequencing using this technology provides for 40 Mbp sequence data per library, meaning that every library is 40-fold redundantly sequenced. By variation of the number of nucleotides, a different amount of A-B fragments are amplified and a different redundancy will be achieved. This can be determined in practice.
Step 3: assembly of the sequences per sequence library (per PC)
Assembly is performed to generate contigs of all A-B fragments that have been amplified per PC. This leads to about 300 to 500 contigs per PC of which the average length will correspond to the average length of the A-B fragment. The sequencing of all PCs results in a number of contigs that varies from several ten thousand up to several hundred thousands (Arabidopsis 21000, Maize 450000).
Step 4: repeat steps 1-3 for at least one other enzyme combination (EC), for instance A-C. This is obligatory because all PCs from EC A-B only provide complementary contigs and not overlapping contigs, which cover only 50% of the genome. The coverage of the genome can be enhanced by also processing all A-A and B-B fragments. By using additional ECs, overlap is achieved between the contigs of AB and of AC and the genome coverage increases.
Step 5: assemble all contigs of A-B (optionally also A-A, B-B) and A-C primer combinations to a (draft) genome sequence.
One of the advantages of this method resides in the fact that one of the problems of genome assembly and the chance on the formation of wrong contig due to the manifold presence of repeat sequences is being minimized by the formation of small dispersed (.i.e. non-adjacent) contigs of 1-10 kb within a 1-5 Mbp fraction of the genome instead of the whole genome. Contigs with lengths that are much larger can be labelled in an early stage as being the product of false joining and discarded. A further advantage is that assembly is computationally less complex because at the initial assembly (step 3), less sequences are involved than when the entire genome sequence is to be assembled in one step. A further advantage is that the selective amplification process renders the entire process scalable to any size genome and it is universally applicable.
Step 1: as above, with a 6-cutter (EcoRI) and a 4-cutter (MseI). The average fragment length is about 250 bp. The A-B fragments represent about 8-15% of the genome. Compared with restriction enzyme digestion using two 6-cutter restriction enzymes, on average about 1 selective nucleotide less is needed to come to an amount of sequence complexity per PC of about 1-5 Mbp.
Step 2: as above. To avoid bias of too short fragments, a size selection can be used to remove fragments below 100-150 bp.
Step 3: Assembly of the sequences per sequence library (per PC)
Assembly is performed to generate contigs of all A-B fragments that have been amplified per PC. This leads to several thousands of contigs per PC of which the average length will correspond to the length of the A-B fragment (250 bp). The sequencing of all PCs results in a number of contigs that varies from several ten thousand up to about a million (Arabidopsis 64000, Maize 1000000).
Step 4: repeat steps 1-3 with a variety of ECs (A-C, B-C, C-C, C-D, A-D etc). This is necessary as the PCs of enzyme combination A-B do not cover more than 8-15% of the genome and, as above, the contigs of the PCs do not overlap.
Step 5: as above.
Step 1: Digest the DNA with one restriction endonuclease A (EcoRI for example). Restriction fragments of about 3-4 kb, depending on GC content and choice of enzyme. Ligate mix to adaptor and perform Long range PCR (see above) with selective primers that reduce the amount of sequence per PC to about 1 Mb. In the case of X=2 and Y=3 repeat for all 1024 PCs. For (X,Y)=(+1/+2) repeat for all 64 PCs.
Step 2: construction of libraries by shearing of each set of amplified adaptor-ligated restriction fragments and sequencing using the emulsion PCR in combination with pyrosequencing of 454 Technologies as described herein elsewhere, for each primer combination (1024 or 64 times). The sequencing using this technology provides for 40 Mbp sequence data per library, meaning that every library is 20-fold redundantly sequenced.
Step 3: Assembly of the sequences per sequence library (per PC)
Assembly is performed to generate contigs of all A-A fragments that have been amplified per PC. This leads, theoretically, to 600-900 of contigs per PC of which the average length will correspond to the length of the A-A fragment (3000 bp). The sequencing of all PCs results in a number of contigs that varies from several tens up to about hundreds of thousands (Arabidopsis 42000, Maize 900000).
Step 4: repeat steps 1-3 with at least one other ECs (B-B). This is necessary as the PCs of enzyme combination A-B do not cover more than 8-15% of the genome and, as above, the contigs of the PCs do not overlap.
Step 5: as above.
This example describes the ability to use high throughput sequencing of AFLP fragments derived from 2 restriction enzyme combinations to determine the genome sequence of a complex plant genome.
The following steps were taken in this example:
A) in silico prediction of AFLP restriction fragments of the Arabidopsis genome sequence (Genbank), using the software tool RECOMB, described in WO0044937 (Keygene N.V).
The entire genome sequence of Arabidopsis genome (ecotype Colombia) was downloaded from Genbank. In silico AFLP+1/+1 fragments for the restriction enzyme combination BamHI/XbaI using +C and +G selective nucleotides, respectively, were predicted using RECOMB. Similarly, AFLP+1/+2 fragments for the restriction enzyme combination EcoRI/HindIII using selective nucleotides +C and +CT were predicted. The collection of AFLP fragments derived from the two in silico digests resulted in various (of approximately 14) overlapping AFLP fragment sequences between the enzyme combinations BamHI/XbaI and EcoRI/HindIII. One of the overlapping restriction fragments forms a contig denoted contig “606”, which has a total length of 662 bp. The sequence of this contig is shown in
The predicted EcoRI/HindIII AFLP +C/+CT fragment in this contig is 218 bp in length, accounting for 32.9% of the total contig length of 606 bp. The predicted BamHI/XbaI AFLP +C/+G fragment is 486 bp long, equaling 73.4% of the total contig length. Both fragments overlap by 42 basepairs as depicted in
B) AFLP Template Preparation and Amplification
Genomic DNA of the Arabidopsis ecotype Colombia and AFLP templates for the restriction enzyme combinations EcoRI/HindIII and BamHI/XbaI were prepared based on the protocols described by Zabeau & Vos, 1993: Selective restriction fragment amplification; a general method for DNA fingerprinting. EP 0534858-A1, B1; U.S. Pat. No. 6,045,994) and Vos et al (Vos, P., Hogers, R., Bleeker, M., Reijans, M., van de Lee, T., Hornes, M., Frijters, A., Pot, J., Peleman, J., Kuiper, M. et al. (1995) AFLP: a new technique for DNA fingerprinting. NuCl. Acids Res., 21, 4407-4414).
The following adaptor sequences (5′-3′) were used:
Selective (+1/+1) amplifications (E/H and B/X) were carried out using the following phosphorothioate primers (5′-3′):
with an “s” is denoting the position of phosphorothioate bonds on the oligonucleotides.
AFLP reactions mixtures had the following composition:
5 ul 1/10 in MQ diluted AFLP template
10 ul 5× herculase II PCR buffer
0.5 ul dNTP's (20 mM)
1.5 ul AFLP primer 1 (50 ng/ul)
1.5 ul AFLP primer 2 (50 ng/ul)
1 ul Herculase® II Fusion DNA-polymerase
30.5 ul MQ
PCR cycling conditions were as follows:
Following AFLP amplification, reactions products were purified using Qiagen columns following the manufacturers protocols
C) 454 sequence library preparation.
Two 454 sequence libraries were prepared using purified BamHI/XbaI AFLP fragments and EcoRI/HindIII AFLP fragments as starting DNA respectively, as described by Margulies and co-workers, starting with nebulization (fragmentation) of the purified AFLP reaction products. A single 454 sequence run was performed using on the GS20 sequencing instrument (Roche Molecular Diagnostics), applying each of the fragment libraries of the two AFLP enzyme combinations to one half of a GS20 PicoTiterPlate.
D) Data Processing
After completion of the sequence run, raw data were processed using the RUNASSEMBLY software of the GS20. Data resulting from the EcoRI/HindIII and BamHI/XbaI AFLP fragment libraries were processed separately and in combination, generating contigs of overlapping sequence reads.
Next, contigs resulting from RUNASSEMBLY were mapped against the reference genome (contig 606 predicted in step a) above using RUNMAPPING, in order to determine to which extent the in silico predicted BamHI/XbaI +C/+G and EcoRI/HindIII +C/+CT AFLP fragments contained in contig 606 were sequenced. Coverage percentages obtained from the respective libraries are shown in the table below.
The resulting sequence contigs are shown in
These results demonstrate the feasibility to determine the genome sequence of complex plant genomes by digesting total genomic DNA with multiple AFLP restriction enzyme combinations, followed by contig assembly per fragment library, and subsequent assembly of the contigs into the plant genome sequence.
Number | Date | Country | Kind |
---|---|---|---|
06075104.7 | Jan 2006 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/NL06/00312 | 6/23/2006 | WO | 00 | 7/14/2008 |
Number | Date | Country | |
---|---|---|---|
60693053 | Jun 2005 | US | |
60714897 | Sep 2005 | US | |
60759034 | Jan 2006 | US |